Install bigWig and latticeExtra package
install.packages("devtools", quiet = TRUE)
## also installing the dependencies 'Rcpp', 'utf8', 'askpass', 'credentials', 'openssl', 'sys', 'zip', 'gitcreds', 'httr2', 'ini', 'httpuv', 'xtable', 'sourcetools', 'later', 'promises', 'fansi', 'systemfonts', 'textshaping', 'pillar', 'pkgconfig', 'diffobj', 'rematch2', 'clipr', 'crayon', 'curl', 'gert', 'gh', 'purrr', 'rprojroot', 'rstudioapi', 'whisker', 'shiny', 'callr', 'processx', 'downlit', 'httr', 'ragg', 'tibble', 'xml2', 'htmlwidgets', 'prettyunits', 'xopen', 'brew', 'commonmark', 'cpp11', 'brio', 'praise', 'ps', 'waldo', 'usethis', 'desc', 'miniUI', 'pkgbuild', 'pkgdown', 'pkgload', 'profvis', 'rcmdcheck', 'remotes', 'roxygen2', 'rversions', 'sessioninfo', 'testthat', 'urlchecker', 'withr'
library(devtools)
## Loading required package: usethis
devtools::install_github('andrelmartins/bigWig',
subdir='bigWig')
## Downloading GitHub repo andrelmartins/bigWig@HEAD
## ── R CMD build ─────────────────────────────────────────────────────────────────
## * checking for file ‘/private/tmp/RtmpCN9BjP/remotes7426b022c8b/andrelmartins-bigWig-beac5a7/bigWig/DESCRIPTION’ ... OK
## * preparing ‘bigWig’:
## * checking DESCRIPTION meta-information ... OK
## * cleaning src
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * building ‘bigWig_0.2-9.tar.gz’
library(bigWig)
install.packages("latticeExtra", quiet = TRUE)
## also installing the dependencies 'deldir', 'RcppEigen', 'png', 'jpeg', 'RColorBrewer', 'interp'
install.packages("DESeq2", quiet = TRUE)
## Warning: package 'DESeq2' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
install.packages("dplyr", quiet = TRUE)
## also installing the dependencies 'generics', 'tidyselect'
Install bedtools
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install bedtools
We begin by using a 101bp window centered around the peak summit and employ MEME/STREME software to identify potential GATA3 binding sites. MEME (classic) identified six GATA3 motifs within the GATA3 ChIP peak set. Subsequent MEME analysis revealed no additional GATA3-matched motifs in peaks without the initial 6 motifs discovered by MEME.
On the other hand, STREME detected two GATA3 motifs in peaks where the 6 MEME-found motifs were absent.
We then expanded the search window from 101bp to 161bp and used MEME and STREME to identify potential binding sites within the full 161bp window and the spanned region only (with the central 81bp swapped to N).
With STREME, we discovered two motifs within the spanned region. These motifs resembled the previously found motifs within the 101bp window. As a result, we concluded that the 101bp window was insufficient for capturing all potential GATA3 binding sites.
Subsequently, we employed MAST to locate peaks containing the previous 8 motifs within the 161bp window. For peaks that didn’t overlap with any motif coordinates, we increased their width from 161bp to 181bp and conducted the same MEME/STREME analysis. However, this time, no GATA3-like motifs were found.
In my GATA3 peak set, GATA3 appears to preferentially bind to the central 161bp window. Despite this, MEME/STREME failed to identify GATA3 binding elements in 38.5% of the peaks.
stack bar plot
a bar plot showing peaks contain GATA-like motifs, and other peaks
wc -l *round*.bed
# 11427 mast_GATA3_PSWM_in_peaks_round1.bed
# 10460 mast_GATA3_PSWM_in_peaks_round2.bed
# 5202 mast_GATA3_PSWM_in_peaks_round3.bed
# 5665 mast_GATA3_PSWM_in_peaks_round4.bed
# 3564 mast_GATA3_PSWM_in_peaks_round5.bed
# 4510 mast_GATA3_PSWM_in_peaks_round6.bed
# 8733 mast_GATA3_PSWM_in_peaks_round7.bed
# 4452 mast_GATA3_PSWM_in_peaks_round8.bed
# 6002 mast_GATA3_PSWM_in_peaks_round9.bed --161bp window
# 60015 total
counts=$(wc -l without_motifs_123456_78_161bp_mast.bed | awk -F"without" '{print $1}') #37308
label1='others'
echo $counts "" $label1 >> bar.csv
total_counts=$(wc -l /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/GATA_ChIP_summit_100window.bed | awk -F"/" '{print $1}') #96868
counts2=$(echo "$total_counts - $counts" | bc)
label2='GATA3-like'
echo $counts2 "" $label2 >> bar.csv
#module load R/4.1.2
#R
library("lattice")
bar=read.csv('bar.csv', sep = "", header=F)
colnames(bar)=c("peak_numbers", "enriched_motifs")
bar$dum.x="peak number"
bar$enriched_motif<- factor(bar$enriched_motif, levels = c("GATA3-like","others"))
pdf('240103_peak_number_with_or_without_GATA3_motif_variant.pdf', width=5,height=6)
my.settings <- list(
#superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
superpose.polygon=list(col=c("red","light grey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(peak_numbers ~ dum.x,
data = bar,
groups = enriched_motif,
stack = TRUE,
auto.key=list(space="right"),
#scales = list(x = list(rot = 45)),
ylab = "peak set within intensity quantile",
xlab = "number of peaks with/without GATA3 motif variant",
par.settings = my.settings)
)
dev.off()
png('240103_peak_number_with_or_without_GATA3_motif_variant.png')
my.settings <- list(
#superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
superpose.polygon=list(col=c("red","light grey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(peak_numbers ~ dum.x,
data = bar,
groups = enriched_motif,
stack = TRUE,
auto.key=list(space="right"),
#scales = list(x = list(rot = 45)),
ylab = "peak set within intensity quantile",
xlab = "number of peaks with/without GATA3 motif variant",
par.settings = my.settings)
)
dev.off()
Figure 1: percent peaks with or without GATA3 motif variants
Figure 2: number of peaks with or without GATA3 motif variants
Figure 3: number of peaks with or without GATA3 motif variants
bar plot (group by peak intensity)
The fraction of peaks with MEME-found-motif 123456, as well as STREME-found motif78, in each quantile:
# quantile file is 1bp summit file, expanded to 161window
#module load R/4.1.2
#R
library(bigWig)
for (chip.peak in Sys.glob(file.path("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_Intensity/GATA_Deseq2/quantile*summits.bed"))) {
print(chip.peak)
quantile.name = strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], '_summits.bed')[[1]][1]
print(quantile.name)
chip.peaks=read.table(chip.peak, header=FALSE)
chip.peak.161win=center.bed(chip.peaks, upstreamWindow=80, downstreamWindow=80)
write.table(chip.peak.161win,file= paste0(quantile.name, '_summits_161window.bed'), quote=F,sep="\t",col.names=F,row.names=F)
}
module load bedtools
dir="/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_Intensity/GATA_Deseq2/"
# quantile peaks with individual motif
for i in ${dir}*quantile*161window.bed
do
nm=$(echo $i | awk -F $dir '{print $2}' | awk -F "_summits_161window.bed" '{print $1}')
intersectBed -wa -a $i -b ../without_motifs_123456_78_161bp_mast.bed | sort -k1,1 -k2,2n | uniq >${nm}_without_motifs_123456_78_161bp_mast.bed
done
touch quantile.240103.sum.161window.txt
for i in *quantile*_without_motifs*.bed
do
nm=$(echo $i | awk -F"_without_motifs_123456_78_161bp_mast.bed" '{print $1}')
p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
totalp=$(wc -l ${dir}${nm}_summits.bed | awk '{OFS="\t";} {print $1}')
withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
echo $nm "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
#module load R/4.1.2
#R
library("lattice")
library("reshape2")
df_sum=read.table('quantile.240103.sum.161window.txt', sep = "", header=FALSE)
colnames(df_sum)=c("quantile","without_Motif","with_Motif")
df_sum_long <- df_sum
df_sum_long <- melt(df_sum_long, id = "quantile")
df_sum_long$variable<- factor(df_sum_long$variable, levels = c("with_Motif", "without_Motif"))
pdf('240103_percent_peaks_withorwithout_GATA3_motif_variant.pdf',width=8,height=5)
my.settings <- list(
superpose.polygon=list(col=c("black", "grey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(value ~ quantile,
data = df_sum_long,
groups = variable,
stack = TRUE,
auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
ylab = "peak set within intensity quantile",
xlab = "fraction of peaks with/without GATA3 motif variant",
par.settings = my.settings)
)
dev.off()
png('240103_percent_peaks_withorwithout_GATA3_motif_variant.png')
my.settings <- list(
superpose.polygon=list(col=c("black", "grey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(value ~ quantile,
data = df_sum_long,
groups = variable,
stack = TRUE,
auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
ylab = "peak set within intensity quantile",
xlab = "fraction of peaks with/without GATA3 motif variant",
par.settings = my.settings)
)
dev.off()
percent peaks with or without GATA3 motif (variant 1,2,3,4,5,6,7,8,9):
Figure 4: percent peaks with or without GATA3 motif variants
For peaks in the top quantile intensity (top 5% highest intensity), we have ~16% peaks MEME/STREME are not able to find GATA3 binding elements.
Stam_MCF-7_1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736581
Stam_MCF-7_2: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736588
IMPORTANT
1) Notice that these data are using hg19 genome, we need to use UCSC liftover to convert data to hg38.
2) While processing the negative control (DHS regions), we want to remove the 8 motifs first, use same window (161bp) and same mast p value.
# 173277 MCF7-DS12619.peaks.fdr0.01.hg19.bed
head -5 MCF7-DS12619.peaks.fdr0.01.hg19.bed
#chr1 10180 10330 . 0 . 12 5.68855 -1 -1
#chr1 16160 16310 . 0 . 11 4.49825 -1 -1
#chr1 237660 237810 . 0 . 41 33.423 -1 -1
#chr1 521440 521590 . 0 . 66 76.4085 -1 -1
#chr1 565560 565710 . 0 . 91 20.8239 -1 -1
#126717 MCF7-DS9445.peaks.fdr0.01.hg19.bed
head -5 MCF7-DS9445.peaks.fdr0.01.hg19.bed
#chr1 237720 237870 . 0 . 11 7.10762 -1 -1
#chr1 521440 521590 . 0 . 16 15.6231 -1 -1
#chr1 565280 565430 . 0 . 41 11.9262 -1 -1
#chr1 565540 565690 . 0 . 54 39.9645 -1 -1
#chr1 565860 566010 . 0 . 95 39.9645 -1 -1
Process the file, keep only the coordinates info. (liftover has format requirements).
awk '{OFS="\t"} {print $1, $2, $3}' MCF7-DS12619.peaks.fdr0.01.hg19.bed > MCF7_hg19_Encode_DHS_Rep2.bed
awk '{OFS="\t"} {print $1, $2, $3}' MCF7-DS9445.peaks.fdr0.01.hg19.bed > MCF7_hg19_Encode_DHS_Rep1.bed
head -5 MCF7_hg19_Encode_DHS_Rep2.bed #173277
#chr1 10180 10330
#chr1 16160 16310
#chr1 237660 237810
#chr1 521440 521590
#chr1 565560 565710
head -5 MCF7_hg19_Encode_DHS_Rep1.bed #126717
#chr1 237720 237870
#chr1 521440 521590
#chr1 565280 565430
#chr1 565540 565690
#chr1 565860 566010
Then we go to the UCSC liftover online tool to convert the hg19 regions to hg38 regions.
https://genome.ucsc.edu/cgi-bin/hgLiftOver
head -5 MCF7_hg38_Encode_DHS_Rep2.bed # 173226
#chr1 10180 10330 chr1:10181-10330 1
#chr1 16160 16310 chr1:16161-16310 1
#chr1 267909 268059 chr1:237661-237810 1
#chr1 586060 586210 chr1:521441-521590 1
#chr1 630180 630330 chr1:565561-565710 1
head -5 MCF7_hg38_Encode_DHS_Rep1.bed # 126690
#chr1 267969 268119 chr1:237721-237870 1
#chr1 586060 586210 chr1:521441-521590 1
#chr1 629900 630050 chr1:565281-565430 1
#chr1 630160 630310 chr1:565541-565690 1
#chr1 630480 630630 chr1:565861-566010 1
Notice that, during liftover process, there are chance to lose some regions.
Remove the original hg19 coordinates info, and keep only the hg38 coordinates.
awk '{OFS="\t"} {print $1, $2, $3}' MCF7_hg38_Encode_DHS_Rep2.bed > MCF7_hg38_Encode_DHS_Rep2_peak.bed
awk '{OFS="\t"} {print $1, $2, $3}' MCF7_hg38_Encode_DHS_Rep1.bed > MCF7_hg38_Encode_DHS_Rep1_peak.bed
wc -l MCF7_hg38_Encode_DHS_Rep2.bed #173226
wc -l MCF7_hg38_Encode_DHS_Rep2_peak.bed #173226
head -5 MCF7_hg38_Encode_DHS_Rep2_peak.bed
#chr1 10180 10330
#chr1 16160 16310
#chr1 267909 268059
#chr1 586060 586210
#chr1 630180 630330
wc -l MCF7_hg38_Encode_DHS_Rep1.bed #126690
wc -l MCF7_hg38_Encode_DHS_Rep1_peak.bed #126690
head -5 MCF7_hg38_Encode_DHS_Rep1_peak.bed
#chr1 267969 268119
#chr1 586060 586210
#chr1 629900 630050
#chr1 630160 630310
#chr1 630480 630630
Make alternative bed files that keep the original intensity info.
Notice that the original file from ENCODE is in bigNarrowPeak format with 10 columns, the 7th column is the signalValue: “Measurement of average enrichment for the region”.
#files contains intensity info with hg19 coordinates
#to match with the converted hg38 file, add 1bp to the start coordinates.
awk '{print $1":"$2+1"-"$3, $7}' MCF7-DS9445.peaks.fdr0.01.hg19.bed > hg19.rep1.intensity.bed
awk '{print $1":"$2+1"-"$3, $7}' MCF7-DS12619.peaks.fdr0.01.hg19.bed > hg19.rep2.intensity.bed
#files contains intensity info with hg19 coordinates
hg19.rep1=read.table('hg19.rep1.intensity.bed', sep = "", header=FALSE) #rep1
hg19.rep2=read.table('hg19.rep2.intensity.bed', sep = "", header=FALSE) #rep2
colnames(hg19.rep1)[1]="hg19"
colnames(hg19.rep2)[1]="hg19"
#files contains intensity info with hg38 coordinates
hg38.rep1=read.table('MCF7_hg38_Encode_DHS_Rep1.bed', sep = "\t", header=FALSE)
hg38.rep2=read.table('MCF7_hg38_Encode_DHS_Rep2.bed', sep = "\t", header=FALSE)
colnames(hg38.rep1)[4]="hg19"
colnames(hg38.rep2)[4]="hg19"
merged.hg38.rep1.intensity <- merge(hg38.rep1, hg19.rep1, by = 'hg19', all.x = TRUE)
merged.hg38.rep2.intensity <- merge(hg38.rep2, hg19.rep2, by = 'hg19', all.x = TRUE)
write.table(merged.hg38.rep1.intensity[, c(2,3,4,6)],file= 'hg38.rep1.intensity.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(merged.hg38.rep2.intensity[, c(2,3,4,6)],file= 'hg38.rep2.intensity.bed', quote=F,sep="\t",col.names=F,row.names=F)
make sure the DHS regions are 161bp window
library(bigWig)
peak.region.summit1=center.bed(read.table('MCF7_hg38_Encode_DHS_Rep1_peak.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
peak.region.summit2=center.bed(read.table('MCF7_hg38_Encode_DHS_Rep2_peak.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
peak.region.161bp.1=center.bed(peak.region.summit1, upstreamWindow = 80, downstreamWindow = 80)
peak.region.161bp.2=center.bed(peak.region.summit2, upstreamWindow = 80, downstreamWindow = 80)
head(peak.region.161bp.1)
## V1 V2 V3
## 1 chr1 267964 268125
## 2 chr1 586055 586216
## 3 chr1 629895 630056
## 4 chr1 630155 630316
## 5 chr1 630475 630636
## 6 chr1 631375 631536
head(peak.region.161bp.2)
## V1 V2 V3
## 1 chr1 10175 10336
## 2 chr1 16155 16316
## 3 chr1 267904 268065
## 4 chr1 586055 586216
## 5 chr1 630175 630336
## 6 chr1 630475 630636
write.table(peak.region.161bp.1,file= 'MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(peak.region.161bp.2,file= 'MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed', quote=F,sep="\t",col.names=F,row.names=F)
remove GATA3 motifs with MAST
The DHS regions contain all regulatory regions that are sensitive to DNAse. To make a independent negative control, we use MAST to remove regulatory regions that overlap with GATA3-like motif 12345678 (use same p-value stringency).
First convert regulatory region file from .bed to .fasta.
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -fo MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta
fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -fo MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta
## chr1 267964 268125
## chr1 586055 586216
## chr1 629895 630056
## >chr1:267964-268125
## GGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTG
## >chr1:586055-586216
## TTTCCCACATTATTCAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCGGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGGGTGGGGGC
## >chr1:629895-630056
## TAACCAATACCACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCTGACATCCGGCCTGCTCCTTCTCACATGACAAAAACTAGCCCCCATC
## chr1 10175 10336
## chr1 16155 16316
## chr1 267904 268065
## >chr1:10175-10336
## aacctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaaccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaaccccaaccccaaccccaaccccaaccctaacccctaa
## >chr1:16155-16316
## CAGATACTCCCTGCTTCCTCTCTAGCCCCCACCCTGCAGAGCTGGACCCCTGAGCTAGCCATGCTCTGACAGTCTCAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGACACAGCAGTGAAGCTGAAA
## >chr1:267904-268065
## CCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGC
We have 8 motifs that found by MEME/STREME software.
ls individual_meme/
## AGATAAM_streme.txt
## GATAmotif1_meme.txt
## GATAmotif2_meme.txt
## GATAmotif3_meme.txt
## GATAmotif4_meme.txt
## GATAmotif5_meme.txt
## GATAmotif6_meme.txt
## TGATAA_streme.txt
MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta
MEME (mast uses default p-value: 0.0001)
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/
#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.bed #5836
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.bed > MCF7DHS_Rep1_161bp_without_motifs_1.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_1.bed #120854
#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_1.bed -fo MCF7DHS_Rep1_161bp_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep1_161bp_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.bed #4978
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.bed > MCF7DHS_Rep1_161bp_without_motifs_12.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_12.bed #115876
#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_12.bed -fo MCF7DHS_Rep1_161bp_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep1_161bp_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.bed #2263
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.bed > MCF7DHS_Rep1_161bp_without_motifs_123.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123.bed #113613
#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep1_161bp_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.bed #3237
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.bed > MCF7DHS_Rep1_161bp_without_motifs_1234.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_1234.bed #110376
#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_1234.bed -fo MCF7DHS_Rep1_161bp_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep1_161bp_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.bed #2629
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.bed > MCF7DHS_Rep1_161bp_without_motifs_12345.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_12345.bed #107747
#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_12345.bed -fo MCF7DHS_Rep1_161bp_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep1_161bp_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.bed #2800
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.bed > MCF7DHS_Rep1_161bp_without_motifs_123456.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456.bed #104947
STREME (mast uses p-value of 0.0005)
#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123456.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep1_161bp_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.bed #8176
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.bed > MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed #96771
#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep1_161bp_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.bed #4383
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.bed > MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed #92388
MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta
MEME (mast uses default p-value: 0.0001)
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/
#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.bed #5249
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.bed > MCF7DHS_Rep2_161bp_without_motifs_1.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_1.bed #167977
#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_1.bed -fo MCF7DHS_Rep2_161bp_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep2_161bp_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.bed #4249
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.bed > MCF7DHS_Rep2_161bp_without_motifs_12.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_12.bed #163728
#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_12.bed -fo MCF7DHS_Rep2_161bp_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep2_161bp_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.bed #1864
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.bed > MCF7DHS_Rep2_161bp_without_motifs_123.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123.bed #161864
#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep2_161bp_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.bed #3075
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.bed > MCF7DHS_Rep2_161bp_without_motifs_1234.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_1234.bed #158789
#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_1234.bed -fo MCF7DHS_Rep2_161bp_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep2_161bp_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.bed #2510
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.bed > MCF7DHS_Rep2_161bp_without_motifs_12345.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_12345.bed #156279
#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_12345.bed -fo MCF7DHS_Rep2_161bp_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep2_161bp_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.bed #2689
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.bed > MCF7DHS_Rep2_161bp_without_motifs_123456.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456.bed #153590
STREME (mast uses p-value of 0.0005)
#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123456.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep2_161bp_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.bed #8783
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.bed > MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed #144807
#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep2_161bp_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.bed #4908
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.bed > MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed #139899
We have discovered 8 motifs using MEME/STREME software. Previously, we concatenated them into a single motif matrix database file (“combined_output_meme.txt”). Now, we aim to perform a coherence check by using MAST to scan this concatenated file against the remaining regions.
#rep1
mast -hit_list -best combined_output_meme.txt MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round9.txt
#rep2
mast -hit_list -best combined_output_meme.txt MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round9.txt
Mast the concatenated file against the remaining regions do not give overlapped regions.
To make an independent control, we can remove all overlapped GATA3 ChIP-seq peak regions from the DHS regions. Then we will MAST against the remaining DHS regions, and see what is the random odds we could get GATA3 bindin region.
module load bedtools
sizes=/home/FCAM/ssun/Genome/hg38.chrom.sizes
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/
slopBed -b 80 -i ${dir}GATA_ChIP_summits_final.bed -g $sizes | sort -k1,1 -k2,2n > GATA_ChIP_summit_161window.bed
#remove overlapped GATA3 peak regions from DHS regions
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -b GATA_ChIP_summit_161window.bed > MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -b GATA_ChIP_summit_161window.bed > MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed
wc -l MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed #126690
wc -l MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed #84362
wc -l MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed #173226
wc -l MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed #131896
convert to fasta.
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed -fo MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta
fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed -fo MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta
## chr1 267964 268125
## chr1 586055 586216
## chr1 629895 630056
## >chr1:267964-268125
## GGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTG
## >chr1:586055-586216
## TTTCCCACATTATTCAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCGGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGGGTGGGGGC
## >chr1:629895-630056
## TAACCAATACCACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCTGACATCCGGCCTGCTCCTTCTCACATGACAAAAACTAGCCCCCATC
## chr1 10175 10336
## chr1 16155 16316
## chr1 267904 268065
## >chr1:10175-10336
## aacctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaaccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaaccccaaccccaaccccaaccccaaccctaacccctaa
## >chr1:16155-16316
## CAGATACTCCCTGCTTCCTCTCTAGCCCCCACCCTGCAGAGCTGGACCCCTGAGCTAGCCATGCTCTGACAGTCTCAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGACACAGCAGTGAAGCTGAAA
## >chr1:267904-268065
## CCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGC
MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta
MEME (mast uses default p-value: 0.0001)
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/
#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.bed #1739
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed #82623
#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.bed #1517
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed #81106
#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.bed #802
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed #80304
#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.bed #1479
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed #78825
#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.bed #1219
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed #77606
#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.bed #1239
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed #76367
STREME (mast uses p-value of 0.0005)
#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.bed #4663
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed #71704
#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.bed #2585
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed #69119
MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta
MEME (mast uses default p-value: 0.0001)
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/
#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.bed #1976
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed #129920
#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.bed #1556
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed #128364
#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.bed #818
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed #127546
#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.bed #1688
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed #125858
#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.bed #1385
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed #124473
#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.bed #1443
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed #123030
STREME (mast uses p-value of 0.0005)
#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.bed #5859
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed #117171
#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.bed #3397
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed #113774
bar plot (group by peak intensity)
The fraction of peaks with MEME-found-motif 123456, as well as STREME-found motif78, in each quantile:
cp /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/bar_plot/quantile.240103.sum.161window.txt .
We will add the two DHS file.
#MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed
#MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed
for i in MCF7DHS*_161bp_without_motifs_123456_78.bed
do
nm=$(echo $i | awk -F"_161bp_without_motifs_123456_78.bed" '{print $1}')
rep=$(echo $i | awk -F"_161bp_without_motifs_123456_78.bed" '{print $1}' | awk -F"MCF7DHS_" '{print $2}')
p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
totalp=$(wc -l MCF7_hg38_Encode_DHS_${rep}_161bp_peak.bed | awk '{OFS="\t";} {print $1}')
withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
echo $nm "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
#MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed
#MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed
for i in MCF7*_161bp_noGATA_without_motifs_123456_78.bed
do
nm=$(echo $i | awk -F"_161bp_noGATA_without_motifs_123456_78.bed" '{print $1}')
rep=$(echo $i | awk -F"_161bp_noGATA_without_motifs_123456_78.bed" '{print $1}' | awk -F"MCF7DHS_" '{print $2}')
p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
totalp=$(wc -l MCF7_hg38_Encode_DHS_${rep}_161bp_noGATA_peak.bed | awk '{OFS="\t";} {print $1}')
withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
echo "${nm}_noGATA" "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
module load R/4.1.2
R
library("lattice")
library("reshape2")
df_sum=read.table('quantile.240103.sum.161window.txt', sep = "", header=FALSE)
colnames(df_sum)=c("quantile","without_Motif","with_Motif")
df_sum_long <- df_sum
df_sum_long <- melt(df_sum_long, id = "quantile")
df_sum_long$variable<- factor(df_sum_long$variable, levels = c("with_Motif", "without_Motif"))
pdf('240105_percent_peaks_withorwithout_GATA3_motif_variant.pdf',width=8,height=5)
my.settings <- list(
superpose.polygon=list(col=c("black", "grey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(value ~ quantile,
data = df_sum_long,
groups = variable,
stack = TRUE,
auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
ylab = "peak set within intensity quantile",
xlab = "fraction of peaks with/without GATA3 motif variant",
par.settings = my.settings)
)
dev.off()
percent peaks with or without GATA3 motif (variant 1,2,3,4,5,6,7,8):
Figure 5: percent peaks with or without GATA3 motif variants
For peaks in the top quantile intensity (top 5% highest intensity), we have ~16% peaks MEME/STREME are not able to find GATA3 binding elements.
We have added full DHS control and an independent DHS control.
Analysis steps:
1) Retrieved MCF7 DHS data (rep1 & rep2) from GSE29692 and used UCSC liftover to convert the data from hg19 to hg38.
2) Modified the DHS regions to 161bp windows.
3) Full DHS control: Applied the same MAST stringency to eliminate regions that overlapped with the 8 MEME/STREME de novo GATA3-like motifs previously identified in GATA3 ChIP peaks.
4) DHS regions removed GATA3 peak regions: use intersectBed to remove overlapped GATA3 ChIP-seq peak regions from the DHS regions, then applied the same MAST stringency to eliminate regions that overlapped with the 8 MEME/STREME de novo GATA3-like motifs previously identified in GATA3 ChIP peaks.
5) Coherence check: Employed MAST on the concatenated motif database against the regions with removed GATA3 motifs —no further regions were found.
6) Counted the regions with and without GATA3 motifs and added them to the bar plot.
Conclusion:
The left four bars are the MCF7 DHS controls. From left to right, they are full rep1 DHS regions, rep1 DHS regions that removed overlapped GATA3 ChIP peak regions, full rep2 DHS regions and rep2 DHS regions that removed overlapped GATA3 ChIPpeak regions.
Among the GATA3 peaks in the lowest intensity quantile (quantile 0.05), over 45% of the peaks contain GATA3 binding elements as identified by MEME/STREME.
The full DHS regions show a 27% overlap with GATA3 motifs in the 1st replicate and a 19% overlap in the 2nd replicate.
The DHS regions removed overlapped GATA3 peak regions has 18% (rep1) and 13.7% (rep2) peaks with GATA3 motifs.
This indicates that randomly we would expect a 14~18% regions has GATA3 motifs found. And GATA3 binding is more specific to regions within the genome (represented by GATA3 ChIP-seq peaks) rather than being uniformly distributed across all accessible DNA regions (represented by the DHS regions).
Consequently, we can infer that even within the GATA3 peaks falling within the lowest quantile, there are significant and meaningful binding sites for GATA3.
Background: we have GATA3 peak that have exhaustively searched for GATA3-like motif with MEME/STREME software. However, there are ~38.5% peak failed in finding a GATA3-like motif. Even for peak subset that have relatively high peak intensity (top5%), we still have ~16% peaks that MEME/STREME cannot find a GATA3-like motif.
Generally, we would expect to find binding sequences within the peak region, so that the ChIPed transcription factor (in our case, GATA3) can bind to. This is even more true for peaks with relatively high intensity.
Our question now became, is MEME/STREME limited in finding binding sequences patterned like GATA3 (that has fixed spacings between two 3mer)? Can we find the 3mer-3mer sequences in GATA3 peaks that do not contain the MEME/STREME found GATA3-like motifs?
There are four DHS negative controls: two reps of full MCF7 DHS regions (refer as “full DHS control”), two reps of MCF7 DHS regions without GATA3 ChIP peak regions (From here, I will refer this sets of control as “independent DHS control”).
All DHS negative controls are removed of the 8 GATA3-like motifs, because we want to compare them to the GATA3 peaks that MEME/STREME failed to find GATA3-like de novo motifs.
MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed #92388
MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed #139899
MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed #69119
MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed #113774
There are 8 GATA3-like motifs found by MEME/STREME. Among them, 5 of them have GATA3 binding motifs that are formed by two GAT (or ATC) 3mer and various spacings.
I will utilize MAST against the 161bp window peaks for all 8 motifs (the concatenated database file) to identify motif coordinates corresponding to each motif. Subsequently, I’ll employ intersectBed to isolate the peak sets containing the best motif within the 161bp window. The peak sets characterized by the presence of the specific motif structures that we are interested in will serve as positive controls.
#GATA_ChIP_summit_161window.bed #the 161bp window around peak summit from the full GATA3 peak
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/
dir2=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/
fastaFromBed -fi $genome -bed ${dir}GATA_ChIP_summit_161window.bed -fo GATA_ChIP_summit_161window.fasta
mast -hit_list -best ${dir2}combined_output_meme.txt GATA_ChIP_summit_161window.fasta > mast_GATA3_PSWM_in_peaks_161win.txt
wget https://raw.githubusercontent.com/sysunn/siyu_daily_update/main/December_2023/parse_multi_mast_to_coordinates.R
Rscript parse_multi_mast_to_coordinates.R mast_GATA3_PSWM_in_peaks_161win.txt 8
wc -l mast_GATA3_PSWM_in_peaks_161win.bed
head -3 mast_GATA3_PSWM_in_peaks_161win.bed
tail -3 mast_GATA3_PSWM_in_peaks_161win.bed
awk '{print > ("mast_GATA3_PSWM_in_peak_161win_motif_" $7 ".bed")}' mast_GATA3_PSWM_in_peaks_161win.bed
ls mast_GATA3_PSWM_in_peak_161win_motif*.bed
## 63124 mast_GATA3_PSWM_in_peaks_161win.bed
## chr1 1505049 1505056 1230.47 6.56e-05 + 1
## chr1 1883638 1883645 1230.47 6.56e-05 + 1
## chr1 5603508 5603515 1312.58 2.16e-05 - 1
## chrX 155296326 155296337 1713.86 1.94e-06 - 8
## chrY 786683 786694 1635.44 4.15e-06 - 8
## chrY 4437398 4437409 1439.4 1.66e-05 - 8
## mast_GATA3_PSWM_in_peak_161win_motif_1.bed
## mast_GATA3_PSWM_in_peak_161win_motif_2.bed
## mast_GATA3_PSWM_in_peak_161win_motif_3.bed
## mast_GATA3_PSWM_in_peak_161win_motif_4.bed
## mast_GATA3_PSWM_in_peak_161win_motif_5.bed
## mast_GATA3_PSWM_in_peak_161win_motif_7.bed
## mast_GATA3_PSWM_in_peak_161win_motif_8.bed
Notice that the motif indices here is based on the PWM order in the concatenated database file:
1: MOTIF AGATAARR –MEME-round3 motif3
2: MOTIF WGATBDHRVAGATAA –MEME-round6 motif6
3: MOTIF BTTATCWGATB –MEME-round5 motif5
4: MOTIF 4-AGATAAM. –STREME motif 1
5: MOTIF AGATNDWNAGATARN. –MEME-round4 motif4
6: MOTIF 6-TGATAA. –STREME motif 2
7: MOTIF WGATBTTATCW –MEME-round1 motif1
8: MOTIF WGATAARVATCW –MEME-round2 motif2
We don’t have peaks assigned with motif indice==6, that means previously we have a set of peak assigned with STREME motif2 (motif8) are 1) no longer defined as contain motifs, or 2) now assigned with other motifs.
When using mast to assess all 8 motifs concurrently, the -best option selectively assigns each peak region with only the most significant motif hit from the list of motif sites.
Previously, STREME identified a set of peaks (n=4452) containing motif8 (STREME motif2), which are now either defined as not containing motifs or associated with motifs other than motif8. Illustrating the fraction of peaks assigned with alternative motifs in this peak set via a bar chart would offer a visual representation of this transition.
module load meme/5.4.1
module load R/4.1.2
module load bedtools
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/
#101 bp window
intersectBed -wa -a ${dir}without_motifs_123456_7.bed -b ${dir}mast_GATA3_PSWM_in_peaks_round8.bed |uniq > peak_101win_with_motif8.bed #4452
head peak_101win_with_motif8.bed
#chr1 1775793 1775894 GATA_ChIP_peak_72 24.5459
#chr1 1855841 1855942 GATA_ChIP_peak_74 3.19627
#chr1 1883543 1883644 GATA_ChIP_peak_76 36.5894
#chr1 2135560 2135661 GATA_ChIP_peak_79 18.7357
#chr1 3613537 3613638 GATA_ChIP_peak_90 27.5967
#161bp window
sizes=/home/FCAM/ssun/Genome/hg38.chrom.sizes
slopBed -b 30 -i peak_101win_with_motif8.bed -g $sizes | sort -k1,1 -k2,2n > peak_161win_with_motif8.bed #4452
When we mast with motif database, we have generated a bed file “mast_GATA3_PSWM_in_peaks_161win.bed” contains motif coordinates info. The 7th column is the motif indices.
head -5 mast_GATA3_PSWM_in_peaks_161win.bed #63124
#chr1 1505049 1505056 1230.47 6.56e-05 + 1
#chr1 1883638 1883645 1230.47 6.56e-05 + 1
#chr1 5603508 5603515 1312.58 2.16e-05 - 1
#chr1 6795903 6795910 1312.58 2.16e-05 + 1
#chr1 7549563 7549570 1230.47 6.56e-05 + 1
We can use intersectBed to intersect these motif coordinates with the full GATA3 peaks (161window) and get all peaks that contains GATA3-like motifs.
Notice that there are peaks containing more than 1 motifs.
intersectBed -wa -wb -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | uniq | head -5
#chr1 869417 869578 GATA_ChIP_peak_30 3.78065 chr1 869461 869467 1284.91 9.58e-05 - 4
#chr1 869417 869578 GATA_ChIP_peak_30 3.78065 chr1 869468 869482 1223.57 5.03e-05 - 5
#chr1 917454 917615 GATA_ChIP_peak_32 75.0238 chr1 917504 917518 1111.37 7.37e-05 + 5
#chr1 917454 917615 GATA_ChIP_peak_32 75.0238 chr1 917551 917565 1076.4 8.4e-05 + 2
#chr1 996069 996230 GATA_ChIP_peak_35 38.3686 chr1 996149 996160 1419.8 1.97e-05 + 8
We can calculate the number of non-duplicated peaks that contain at least one motif from the motif database (all 8 motifs) defined by mast use a stringency of 0.0001.
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | awk '{OFS="\t"} {print $4}' | uniq | wc -l #51122
Recall that in the Exhaustive MEME/STREME analysis, each round we mast with single motif PMW against peaks. And for the last two round of STREME, we have loosen the mast stringency to 0.005.
Therefore, we end up have more peaks defined as having motifs (n=59560) compare to here when we applied a global mast strincy of 0.0001 (n=51122).
By simple math, there are 8438 peaks previously defined as containing motifs now are no longer having motifs.
We can get this peak set:
#new set of peak do not contain motifs
intersectBed -v -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | uniq > without_motifs_all_161bp_mast_new.bed #45746
head -5 without_motifs_all_161bp_mast_new.bed
#chr1 827300 827461 GATA_ChIP_peak_28 9.30478
#chr1 845716 845877 GATA_ChIP_peak_29 541.83
#chr1 916689 916850 GATA_ChIP_peak_31 7.79887
#chr1 924773 924934 GATA_ChIP_peak_33 3.78065
#chr1 966573 966734 GATA_ChIP_peak_34 3.78065
#old set of peak do not contain motifs
dir2=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/
head -5 ${dir2}without_motifs_123456_78_161bp_mast.bed #37308 peaks previously defined as not containing motif
#chr1 827300 827461 GATA_ChIP_peak_28 9.30478
#chr1 916689 916850 GATA_ChIP_peak_31 7.79887
#chr1 924773 924934 GATA_ChIP_peak_33 3.78065
#chr1 966573 966734 GATA_ChIP_peak_34 3.78065
#chr1 999428 999589 GATA_ChIP_peak_36 2.11515
#old set of peak contain motifs
intersectBed -v -f 1 -a ${dir}GATA_ChIP_summit_161window.bed -b ${dir2}without_motifs_123456_78_161bp_mast.bed > with_motifs_all_161bp_mast_old.bed #59560 (==96868-37308)
# get peaks that previously contain motifs now no longer contain motifs by mast with db
awk 'FNR==NR{peaks[$4]; next} $4 in peaks' with_motifs_all_161bp_mast_old.bed without_motifs_all_161bp_mast_new.bed > peak_no_longer_contain_motifs.bed #8439
head -5 peak_no_longer_contain_motifs.bed
#chr1 845716 845877 GATA_ChIP_peak_29 541.83
#chr1 1746867 1747028 GATA_ChIP_peak_70 4.46672
#chr1 1775763 1775924 GATA_ChIP_peak_72 24.5459
#chr1 1855811 1855972 GATA_ChIP_peak_74 3.19627
#chr1 2135530 2135691 GATA_ChIP_peak_79 18.7357
For our previous defined peak set that contain motif8 (STREME motif2) (n=4452):
we expect 1) some peak no longer contain motifs:
intersectBed -wa -wb -f 1 -F 1 -a peak_no_longer_contain_motifs.bed -b peak_161win_with_motif8.bed > peak_161win_with_motif8_no_longer_contain_motifs.bed #3528
head -5 peak_161win_with_motif8_no_longer_contain_motifs.bed
#chr1 1775763 1775924 GATA_ChIP_peak_72 24.5459 chr1 1775763 1775924 GATA_ChIP_peak_72 24.5459
#chr1 1855811 1855972 GATA_ChIP_peak_74 3.19627 chr1 1855811 1855972 GATA_ChIP_peak_74 3.19627
#chr1 2135530 2135691 GATA_ChIP_peak_79 18.7357 chr1 2135530 2135691 GATA_ChIP_peak_79 18.7357
#chr1 3613507 3613668 GATA_ChIP_peak_90 27.5967 chr1 3613507 3613668 GATA_ChIP_peak_90 27.5967
#chr1 6359876 6360037 GATA_ChIP_peak_119 32.8724 chr1 6359876 6360037 GATA_ChIP_peak_119 32.8724
${dir2}without_motifs_123456_78_161bp_mast.bed | uniq > peak_no_longer_contain_motifs.bed #8416
head -5 peak_no_longer_contain_motifs.bed
#chr1 845716 845877 GATA_ChIP_peak_29 541.83
#chr1 1746867 1747028 GATA_ChIP_peak_70 4.46672
#chr1 1775763 1775924 GATA_ChIP_peak_72 24.5459
#chr1 1855811 1855972 GATA_ChIP_peak_74 3.19627
#chr1 2135530 2135691 GATA_ChIP_peak_79 18.7357
intersectBed -wa -wb -a peak_no_longer_contain_motifs.bed -b peak_161win_with_motif8.bed | uniq > peak_161win_with_motif8_no_longer_contain_motifs.bed #3525
#coherence check
#there are peaks assigned with more than 1 motif
awk '{OFS="\t"} {print $4}' peak_161win_with_motif8_no_longer_contain_motifs.bed | uniq | wc -l #3524
Previously we have defined 4452 peaks contain motif8, now we have 3528 peaks no longer defined as containing motifs.
To extract peak and motif information: we can use the peak indices and awk to filter the rows in GATA3_peak_161win_with_all_mast_motif.bed based on whether their peak index (in the fourth column) is present in peak_161win_with_motif8.bed (peaks that previously found to contain motif8).
# all peaks that contain newly assigned motifs by mast to db
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/
intersectBed -wa -wb -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | awk '{OFS="\t"} {print $1, $2, $3, $4, $12}' | uniq > GATA3_peak_161win_with_all_mast_motif.bed #63127
awk 'FNR==NR{peaks[$4]; next} $4 in peaks' peak_161win_with_motif8.bed GATA3_peak_161win_with_all_mast_motif.bed > filtered_GATA3_peak_161win_with_all_mast_motif.bed #997
head -5 filtered_GATA3_peak_161win_with_all_mast_motif.bed
#chr1 1883513 1883674 GATA_ChIP_peak_76 1
#chr1 3831729 3831890 GATA_ChIP_peak_96 8
#chr1 4597226 4597387 GATA_ChIP_peak_100 3
#chr1 8875638 8875799 GATA_ChIP_peak_226 7
#chr1 10277773 10277934 GATA_ChIP_peak_290 5
#coherence check
#there are peaks assigned with more than 1 motif
awk '{OFS="\t"} {print $4}' filtered_GATA3_peak_161win_with_all_mast_motif.bed | uniq | wc -l #924
There are 924 unique peaks previously defined as containing motif8 (STREME motif2) now assigned with other motif(s) by mast.
#module load R/4.1.2
#R
library(lattice)
bar=read.table(file = "bar.txt", header = T)
bar$dum.x="peak fraction"
bar$supp=factor(bar$supp, levels=c("new_motif","no_motif"))
pdf('240111_peaks_with_GATA3_motif8_change_by_new_mast.pdf', width=5,height=6)
my.settings <- list(
#superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
superpose.polygon=list(col=c("blue","lightgrey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(fraction ~ dum.x,
data = bar,
groups = supp,
stack = TRUE,
ylim=c(0,1),
auto.key=list(space="right"),
#scales = list(x = list(rot = 45)),
ylab = "peak set previously contain motif8",
xlab = "fration of peaks changed when mast with motif db",
par.settings = my.settings)
)
dev.off()
Figure 6: peak_with_motif8 either no longer defined as containing motifs or assigned with new motifs when mast with database
# Read the BED file
bed_data <- read.table("filtered_GATA3_peak_161win_with_all_mast_motif.bed", sep="\t", header=FALSE, stringsAsFactors=FALSE)
# Aggregate motif indices for each peak
aggregated_motifs <- aggregate(bed_data$V5, by = list(bed_data$V4), FUN = function(x) paste(unique(x), collapse = ","))
aggregated_motifs$nmotif <- sapply(aggregated_motifs$x, function(x) length(unlist(strsplit(x, ","))))
colnames(aggregated_motifs)=c("peak", "motif_indices", "nmotif")
head(aggregated_motifs)
# peak motif_indices nmotif
#1 GATA_ChIP_peak_100 3 1
#2 GATA_ChIP_peak_10048 1 1
#3 GATA_ChIP_peak_10198b 7,3 2
#4 GATA_ChIP_peak_10245 8 1
#5 GATA_ChIP_peak_10360 7 1
#6 GATA_ChIP_peak_10760a 2 1
nrow(aggregated_motifs)
#[1] 924
length(unique(aggregated_motifs$peak))
#[1] 924
str(aggregated_motifs)
#'data.frame': 924 obs. of 3 variables:
# $ peak : chr "GATA_ChIP_peak_100" "GATA_ChIP_peak_10048" "GATA_ChIP_peak_10198b" "GATA_ChIP_peak_10245" ...
# $ motif_indices: chr "3" "1" "7,3" "8" ...
# $ nmotif : int 1 1 2 1 1 1 2 1 1 1 ...
library(lattice)
group_counts <- as.data.frame(table(aggregated_motifs$motif_indices))
names(group_counts) <- c('Group', 'Count')
summary(group_counts$Count)
#Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1 1 2 22 4 193
group_counts$Group=factor(group_counts$Group, levels=c( "1", "2", "3", "4", "5", "7", "8", "1,2", "1,4", "1,5", "1,8", "2,1", "2,3",
"2,7", "2,8", "3,1", "3,2", "3,4", "3,7", "3,8", "4,2","4,3", "4,5", "4,8", "5,1","5,2", "5,3","5,4", "5,7", "5,8", "7,1", "7,2","7,3", "7,4", "7,5", "7,8", "8,1", "8,2", "8,3", "8,4", "8,5", "8,7" ))
# Create the bar plot
pdf('240110_peaks_with_GATA3_motif_variant.pdf',width=15,height=10)
#custom_labels <- seq(1, 4000, by = 20)
print(barchart(Count ~ Group,
data= group_counts,
col = "skyblue",
ylim=c(0, 200),
xlab = "motif_indices",
ylab = "Number of Peaks",
main = "Number of Peaks contain the motif",
scales = list(x = list(rot = 45)), #y = list(at = custom_labels)
horizontal = FALSE)
)
dev.off()
Figure 7: peak_with_motif6 assigned with new motifs when mast with database
The peak sets that used to defined as containing motif8 when mast single motif (motif8) against peaks without motif12345, are now assigned with other motifs when mast against entire GATA3 peak sets with all 8 motifs.
Summary
Peaks that were previously identified to contain motif8 through mast with a single motif PWM are no longer associated with motif8 when mast is applied to the motif database encompassing all eight motifs. This shift arises from two primary factors.
Firstly, some peaks are now unassigned to any motif, attributed to the heightened stringency (0.0001) in the current analysis, as opposed to the less stringent criterion (0.0005) employed in the previous round.
Conversely, certain peaks are now affiliated with alternative motifs. This shift can be attributed to the improved statistical calculations facilitated by mast using the -best option, enhancing motif assignments beyond the singular focus on motif8.
Prepare and upload these files to UCSC Genome browser:
a BED for peak (subset, 924 peaks previously enriched with motif8 now have assigned with other motifs) coordinates (101bp or 161bp).
peak_101win_with_motif8.bed #4452
peak_161win_with_motif8.bed #4452
awk '{OFS="\t"} {print $1, $2, $3, $4}' filtered_GATA3_peak_161win_with_all_mast_motif.bed | uniq > peak_161win_with_motif8_with_new_motifs.bed #924
head -5 peak_161win_with_motif8_with_new_motifs.bed #-161bp
#chr1 1883513 1883674 GATA_ChIP_peak_76
#chr1 3831729 3831890 GATA_ChIP_peak_96
#chr1 4597226 4597387 GATA_ChIP_peak_100
#chr1 8875638 8875799 GATA_ChIP_peak_226
#chr1 10277773 10277934 GATA_ChIP_peak_290
a BED for each newly assigned motif coordinates in the 161/101 window in the browser.
# newly assigned motif coordinates for all peaks
7431 mast_GATA3_PSWM_in_peak_161win_motif_1.bed
9352 mast_GATA3_PSWM_in_peak_161win_motif_2.bed
6702 mast_GATA3_PSWM_in_peak_161win_motif_3.bed
4815 mast_GATA3_PSWM_in_peak_161win_motif_4.bed
9180 mast_GATA3_PSWM_in_peak_161win_motif_5.bed
13271 mast_GATA3_PSWM_in_peak_161win_motif_7.bed
12373 mast_GATA3_PSWM_in_peak_161win_motif_8.bed
#intersect with the 924 peaks previously contain motif8
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_1.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_1_PSWM_in_peak_161win_previous_with_motif8.bed #104
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_2.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_2_PSWM_in_peak_161win_previous_with_motif8.bed #173
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_3.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_3_PSWM_in_peak_161win_previous_with_motif8.bed #104
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_4.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_4_PSWM_in_peak_161win_previous_with_motif8.bed #67
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_5.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_5_PSWM_in_peak_161win_previous_with_motif8.bed #130
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_7.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_7_PSWM_in_peak_161win_previous_with_motif8.bed #196
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_8.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_8_PSWM_in_peak_161win_previous_with_motif8.bed #224
head -5 motif_1_PSWM_in_peak_161win_previous_with_motif8.bed
#chr1 1883638 1883645 1230.47 6.56e-05 + 1 chr1 1883513 1883674 GATA_ChIP_peak_76
#chr1 20232820 20232827 1230.47 6.56e-05 - 1 chr1 20232679 20232840 GATA_ChIP_peak_603
#chr1 37484682 37484689 1230.47 6.56e-05 - 1 chr1 37484679 37484840 GATA_ChIP_peak_1230
#chr1 89045539 89045546 1230.47 6.56e-05 + 1 chr1 89045405 89045566 GATA_ChIP_peak_2681
#chr1 95084738 95084745 1312.58 2.16e-05 - 1 chr1 95084600 95084761 GATA_ChIP_peak_2867
a BED for the previous motif8 coordinates in 101 window in the browser.
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/
#101 bp window
intersectBed -wa -wb -a ${dir}mast_GATA3_PSWM_in_peaks_round8.bed -b ${dir}without_motifs_123456_7.bed | uniq > old_mast_motif_8_PSWM_in_peak_101win.bed #4454
head -5 old_mast_motif_8_PSWM_in_peak_101win.bed
#chr1 1775856 1775861 1142.66 0.000349 + 1 chr1 1775793 1775894 GATA_ChIP_peak_72 24.5459
#chr1 1855920 1855925 1142.66 0.000349 + 1 chr1 1855841 1855942 GATA_ChIP_peak_74 3.19627
#chr1 1883563 1883568 1142.66 0.000349 - 1 chr1 1883543 1883644 GATA_ChIP_peak_76 36.5894
#chr1 2135607 2135612 1142.66 0.000349 - 1 chr1 2135560 2135661 GATA_ChIP_peak_79 18.7357
#chr1 3613592 3613597 1142.66 0.000349 - 1 chr1 3613537 3613638 GATA_ChIP_peak_90 27.5967
awk 'FNR==NR{peaks[$4]; next} $11 in peaks' peak_161win_with_motif8_with_new_motifs.bed old_mast_motif_8_PSWM_in_peak_101win.bed > old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed #925
head -5 old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed
#chr1 1883563 1883568 1142.66 0.000349 - 1 chr1 1883543 1883644 GATA_ChIP_peak_76 36.5894
#chr1 3831836 3831841 1142.66 0.000349 - 1 chr1 3831759 3831860 GATA_ChIP_peak_96 162.768
#chr1 4597295 4597300 1142.66 0.000349 - 1 chr1 4597256 4597357 GATA_ChIP_peak_100 8.5429
#chr1 8875760 8875765 1142.66 0.000349 + 1 chr1 8875668 8875769 GATA_ChIP_peak_226 8.89252
#chr1 10277867 10277872 1142.66 0.000349 - 1 chr1 10277803 10277904 GATA_ChIP_peak_290 5.68898
Add trackline:
#peak region
peak_101win_with_motif8.bed #4452
peak_161win_with_motif8.bed #4452
peak_161win_with_motif8_with_new_motifs.bed #924
#motifs-old
old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed #925
#motifs-new
motif_1_PSWM_in_peak_161win_previous_with_motif8.bed
motif_2_PSWM_in_peak_161win_previous_with_motif8.bed
motif_3_PSWM_in_peak_161win_previous_with_motif8.bed
motif_4_PSWM_in_peak_161win_previous_with_motif8.bed
motif_5_PSWM_in_peak_161win_previous_with_motif8.bed
motif_7_PSWM_in_peak_161win_previous_with_motif8.bed
motif_8_PSWM_in_peak_161win_previous_with_motif8.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"peak_101win_with_motif8.bed\" description=\"peak_101win_full\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' peak_101win_with_motif8.bed > peak_101win_with_motif8.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"peak_161win_with_motif8.bed\" description=\"peak_161win_full\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' peak_161win_with_motif8.bed > peak_161win_with_motif8.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"peak_161win_with_motif8_new.bed\" description=\"peak_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' peak_161win_with_motif8_with_new_motifs.bed > peak_161win_with_motif8_with_new_motifs.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed\" description=\"motif8_101win_old\" visibility=full autoScale=on useScore=1 color=255,0,0"
} {print $1, $2, $3}' old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed > old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"motif_1_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif1_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $1, $2, $3}' motif_1_PSWM_in_peak_161win_previous_with_motif8.bed > motif_1_PSWM_in_peak_161win_previous_with_motif8.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"motif_2_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif2_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $1, $2, $3}' motif_2_PSWM_in_peak_161win_previous_with_motif8.bed > motif_2_PSWM_in_peak_161win_previous_with_motif8.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"motif_3_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif3_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $1, $2, $3}' motif_3_PSWM_in_peak_161win_previous_with_motif8.bed > motif_3_PSWM_in_peak_161win_previous_with_motif8.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"motif_4_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif4_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $1, $2, $3}' motif_4_PSWM_in_peak_161win_previous_with_motif8.bed > motif_4_PSWM_in_peak_161win_previous_with_motif8.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"motif_5_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif5_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $1, $2, $3}' motif_5_PSWM_in_peak_161win_previous_with_motif8.bed > motif_5_PSWM_in_peak_161win_previous_with_motif8.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"motif_7_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif7_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $1, $2, $3}' motif_7_PSWM_in_peak_161win_previous_with_motif8.bed > motif_7_PSWM_in_peak_161win_previous_with_motif8.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"motif_8_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif8_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $1, $2, $3}' motif_8_PSWM_in_peak_161win_previous_with_motif8.bed > motif_8_PSWM_in_peak_161win_previous_with_motif8.header.bed
#with_motifs_1_101bp_mast.bed
#with_motifs_1_161bp_mast.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_1_101bp_mast.bed | sort | uniq | wc -l #11428
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_1_161bp_mast.bed | sort | uniq | wc -l #1042
#keep the first 4 column (uniq)-the peak region info
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_1_101bp_mast.bed | sort | uniq >temp_101bp.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_1_161bp_mast.bed | sort | uniq >temp_161bp.bed
#combine together
cat temp_101bp.bed temp_161bp.bed > GATA3_peak_161win_with_motif_1.bed
wc -l GATA3_peak_161win_with_motif_1.bed
#12470 GATA3_peak_161win_with_motif_1.bed
head -5 GATA3_peak_161win_with_motif_1.bed
#chr10 100072622 100072723 GATA_ChIP_peak_10714
#chr10 100522457 100522558 GATA_ChIP_peak_10733
#chr10 101380152 101380253 GATA_ChIP_peak_10758b
#chr10 101453866 101453967 GATA_ChIP_peak_10760b
#chr10 101892915 101893016 GATA_ChIP_peak_10774
#with_motifs_2_101bp_mast.bed
#with_motifs_2_161bp_mast.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_2_101bp_mast.bed | sort | uniq | wc -l #10461
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_2_161bp_mast.bed | sort | uniq | wc -l #1014
#keep the first 4 column (uniq)-the peak region info
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_2_101bp_mast.bed | sort | uniq >temp_101bp.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_2_161bp_mast.bed | sort | uniq >temp_161bp.bed
#combine together
cat temp_101bp.bed temp_161bp.bed > GATA3_peak_161win_with_motif_2.bed
wc -l GATA3_peak_161win_with_motif_2.bed
#11475 GATA3_peak_161win_with_motif_2.bed
head -5 GATA3_peak_161win_with_motif_2.bed
#chr10 100301265 100301366 GATA_ChIP_peak_10723
#chr10 100742525 100742626 GATA_ChIP_peak_10737
#chr10 100857059 100857160 GATA_ChIP_peak_10738
#chr10 101033052 101033153 GATA_ChIP_peak_10745
#chr10 101310780 101310881 GATA_ChIP_peak_10754
#with_motifs_4_101bp_mast.bed
#with_motifs_4_161bp_mast.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_4_101bp_mast.bed | sort | uniq | wc -l #5665
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_4_161bp_mast.bed | sort | uniq | wc -l #840
#keep the first 4 column (uniq)-the peak region info
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_4_101bp_mast.bed | sort | uniq >temp_101bp.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_4_161bp_mast.bed | sort | uniq >temp_161bp.bed
#combine together
cat temp_101bp.bed temp_161bp.bed > GATA3_peak_161win_with_motif_4.bed
wc -l GATA3_peak_161win_with_motif_4.bed
#6505 GATA3_peak_161win_with_motif_4.bed
head -5 GATA3_peak_161win_with_motif_4.bed
#chr10 101454211 101454312 GATA_ChIP_peak_10760c
#chr10 101703087 101703188 GATA_ChIP_peak_10768a
#chr10 101716906 101717007 GATA_ChIP_peak_10770
#chr10 102509685 102509786 GATA_ChIP_peak_10791b
#chr10 1025897 1025998 GATA_ChIP_peak_7503
#with_motifs_5_101bp_mast.bed
#with_motifs_5_161bp_mast.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_5_101bp_mast.bed | sort | uniq | wc -l #3564
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_5_161bp_mast.bed | sort | uniq | wc -l #603
#keep the first 4 column (uniq)-the peak region info
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_5_101bp_mast.bed | sort | uniq >temp_101bp.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_5_161bp_mast.bed | sort | uniq >temp_161bp.bed
#combine together
cat temp_101bp.bed temp_161bp.bed > GATA3_peak_161win_with_motif_5.bed
wc -l GATA3_peak_161win_with_motif_5.bed
#4167 GATA3_peak_161win_with_motif_5.bed
head -5 GATA3_peak_161win_with_motif_5.bed
#chr10 100139100 100139201 GATA_ChIP_peak_10715
#chr10 100374094 100374195 GATA_ChIP_peak_10727b
#chr10 100413259 100413360 GATA_ChIP_peak_10729
#chr10 101355424 101355525 GATA_ChIP_peak_10756
#chr10 101402156 101402257 GATA_ChIP_peak_10759
#with_motifs_6_101bp_mast.bed
#with_motifs_6_161bp_mast.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_6_101bp_mast.bed | sort | uniq | wc -l #4510
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_6_161bp_mast.bed | sort | uniq | wc -l #853
#keep the first 4 column (uniq)-the peak region info
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_6_101bp_mast.bed | sort | uniq >temp_101bp.bed
awk '{OFS="\t"} {print $1, $2, $3, $4}' with_motifs_6_161bp_mast.bed | sort | uniq >temp_161bp.bed
#combine together
cat temp_101bp.bed temp_161bp.bed > GATA3_peak_161win_with_motif_6.bed
wc -l GATA3_peak_161win_with_motif_6.bed
#5363 GATA3_peak_161win_with_motif_6.bed
head -5 GATA3_peak_161win_with_motif_6.bed
#chr10 100015084 100015185 GATA_ChIP_peak_10709
#chr10 101986847 101986948 GATA_ChIP_peak_10777
#chr10 103188015 103188116 GATA_ChIP_peak_10826b
#chr10 103931753 103931854 GATA_ChIP_peak_10852
#chr10 103979874 103979975 GATA_ChIP_peak_10857b
GATA3_peak_161win_with_motif_1.bed
GATA3_peak_161win_with_motif_2.bed
GATA3_peak_161win_with_motif_4.bed
GATA3_peak_161win_with_motif_5.bed
GATA3_peak_161win_with_motif_6.bed
get all peaks (161bp window) without the 8 MEME/STREME motifs. Parse the peaks to 4 quantile based on peak intensity (use deseq2).
for each quantile, as well as the two negative control (MCF7 DHS rep1, rep2), get the closest GAT to peak summit.
find the second closest GAT.
Make sure that the we want to use the relative distance between the two zinc fingers.
and orientation.
peaks (161bp window) without 8 GATA3-motif (that found by MEME/STREME)
#cd /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8
wc -l without_motifs_123456_78_161bp_mast.bed #37308
head -5 without_motifs_123456_78_161bp_mast.bed
#chr1 827300 827461 GATA_ChIP_peak_28 9.30478
#chr1 916689 916850 GATA_ChIP_peak_31 7.79887
#chr1 924773 924934 GATA_ChIP_peak_33 3.78065
#chr1 966573 966734 GATA_ChIP_peak_34 3.78065
#chr1 999428 999589 GATA_ChIP_peak_36 2.11515
In the above file, the last column is the peak intensity reported by MACS3.
Notice that a peak with high MACS3-intensity is not necessarily a intense peak. There are two things we need to consider regarding peak intensities. The first is peak region; the second is dynamic range. Imaging a peak that is very intense but within a narrow range, the other peak is not so intense but can span its signals across a very long distance. We need to consider this region differences while calling any peak to be intense or not.
First use Sam Flag 0x3(reads paired and mapped in proper pair) to calculate the read depth for each GATA library.
# calculate the size factors
module load samtools/1.12
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/
for i in GATA
do
echo $i
> ${i}_header.txt
> ${i}_reads.txt
for j in ${dir}MCF7_dTAGGATA522*_${i}_*.sorted.bam
do
echo $j
name=$(echo $j | awk -F $dir '{print $2}' | awk -F".sorted.bam" '{print $1}')
echo $name | paste ${i}_header.txt - > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_header.txt
reads=`samtools view -c -f 0x3 $j` #count the reads paired and mapped in proper pair
echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
done
cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
rm ${i}_header.txt
done
cat GATA_reads.txt
## MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2 MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1 MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3 MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2 MCF7_dTAGGATA522_GATA_dE_rep3
## 33948616 32585396 34475586 51112588 147834968 136838760 34136142 136271358 85665512
bargraph
library(lattice)
df=as.data.frame(t(read.table("GATA_reads.txt", header=F)))
colnames(df)=c("library", "aligned_reads")
df$libraey=as.factor(df$library)
df$aligned_reads=as.numeric((df$aligned_reads))
barchart(aligned_reads ~ library,
data = df,
ylim=c(0, max(df$aligned_reads)*1.04),
col = "skyblue",
scales = list(x = list(rot = 45)),
xlab = "GATA3 ChIP library",
ylab = "concordantly aligned reads"
)
This bar graph represents the read depth in each GATA3 ChIP-seq library. In the downstream analysis, we will need to use DESeq2 to normalize the counts in each library with size factor to account for this read depth difference.
Get peak intensity within 400bp window withDeseq2 for peaks without MEME/STREME found motifs 12345678.
load peaks without motif 12345678.
#module load R/4.1.2
#R
a = read.table('without_motifs_123456_78_161bp_mast.bed', sep = "\t", header=FALSE)
nrow(a) #37308
## [1] 37308
head(a)
## V1 V2 V3 V4 V5
## 1 chr1 827300 827461 GATA_ChIP_peak_28 9.30478
## 2 chr1 916689 916850 GATA_ChIP_peak_31 7.79887
## 3 chr1 924773 924934 GATA_ChIP_peak_33 3.78065
## 4 chr1 966573 966734 GATA_ChIP_peak_34 3.78065
## 5 chr1 999428 999589 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000456 1000617 GATA_ChIP_peak_37 5.52004
Increase the width to 400bp window.
library(bigWig)
peak.region.400win=center.bed(a, upstreamWindow = 200, downstreamWindow = 200)
nrow(peak.region.400win)
## [1] 37308
head(peak.region.400win)
## V1 V2 V3 V4 V5
## 1 chr1 827180 827581 GATA_ChIP_peak_28 9.30478
## 2 chr1 916569 916970 GATA_ChIP_peak_31 7.79887
## 3 chr1 924653 925054 GATA_ChIP_peak_33 3.78065
## 4 chr1 966453 966854 GATA_ChIP_peak_34 3.78065
## 5 chr1 999308 999709 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000336 1000737 GATA_ChIP_peak_37 5.52004
In the below chunk, we define a function to get raw counts from each sample/reps. This function uses bed.region.bpQuery.bigWig from bigWig package. It requires a bed region file that has the peak region that we want to analysis (right now we want to analyse peak without motifs 123456789); it also requires non-normalized, raw bigWig files (generated directly from seqOutbias) to allow bed.region.bpQuery.bigWig to query the counts information from. The output from this function will be: rows will be the same as bed file, columns will be each bigWig library, entries will be the raw counts.
#functions on github
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')
#function
get.counts.interval <- function(df, path.to.bigWig, file.prefix = 'H') {
vec.names = c()
inten.df=data.frame(matrix(ncol = 0, nrow = nrow(df)))
for (mod.bigWig in Sys.glob(file.path(path.to.bigWig, paste(file.prefix, "*.bigWig", sep ='')))) {
factor.name = strsplit(strsplit(mod.bigWig, "/")[[1]][length(strsplit(mod.bigWig, "/")[[1]])], '.bigWig')[[1]][1]
print(factor.name)
vec.names = c(vec.names, factor.name)
loaded.bw = load.bigWig(mod.bigWig)
print(mod.bigWig)
mod.inten = bed.region.bpQuery.bigWig(loaded.bw, df, abs.value = TRUE)
inten.df = cbind(inten.df, mod.inten)
}
colnames(inten.df) = vec.names
r.names = paste(df[,1], ':', df[,2], '-', df[,3], sep='')
row.names(inten.df) = r.names
return(inten.df)
}
We will first use the defined function to query raw counts from each non-normalized bigWig files of GATA ChIP-seq use peak region info loaded before (data frame “peak.region.400win”).
library(bigWig)
#non-normalized counts
GATA.counts.df= get.counts.interval(peak.region.400win, "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/bigWigs/Seqoutbias_bw","MCF") #25 libraries
#nrow(peak.region.400win)
#[1] 37308
#nrow(GATA.counts.df)
#[1] 37308
#head(GATA.counts.df)
#colnames(GATA.counts.df)
GATA.analysis.regions=GATA.counts.df[,grepl("_GATA_",colnames(GATA.counts.df))] # get non-normalized counts from "GATA" libraries
#colnames(GATA.analysis.regions)
#[1] "MCF7_dTAGGATA522_GATA_CC_rep1" "MCF7_dTAGGATA522_GATA_CC_rep2"
#[3] "MCF7_dTAGGATA522_GATA_CC_rep3" "MCF7_dTAGGATA522_GATA_CE_rep1"
#[5] "MCF7_dTAGGATA522_GATA_CE_rep2" "MCF7_dTAGGATA522_GATA_CE_rep3"
#[7] "MCF7_dTAGGATA522_GATA_dE_rep1" "MCF7_dTAGGATA522_GATA_dE_rep2"
#[9] "MCF7_dTAGGATA522_GATA_dE_rep3"
identical(rownames(GATA.analysis.regions),rownames(GATA.counts.df)) # [1] TRUE
Then we use the DESeq2 package to make a counts matrix (DESeqDataSetFromMatrix) and calculate size factors for each library (estimateSizeFactorsForMatrix) use the previously calculated read depth for each library (“GATA_reads.txt”); We use this size factor to normalize the counts (sizeFactors).
Then we use rowMeans to average the normalized counts for the three GATA_CC reps, and save it as “peak.intensities”; this normalized counts now can be use to determine if a peak is intense or not.
library(DESeq2)
sample.conditions = factor(sapply(strsplit(colnames(GATA.analysis.regions), '_rep'), '[', 1))
#[1] MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC
#[4] MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE
#[7] MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE
#3 Levels: MCF7_dTAGGATA522_GATA_CC ... MCF7_dTAGGATA522_GATA_dE
deseq.counts.table = DESeqDataSetFromMatrix(countData = GATA.analysis.regions, # DESeqDataSet needs countData to be non-negative integers; non-normalized counts are integer, normalized signals has decimals.
colData = as.data.frame(sample.conditions),
design = ~ sample.conditions)
GATA.SF <- read.table("GATA_reads.txt", sep = '\t', header = TRUE)[,-1] # GATA size factors from read depth
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
# 33948616 32585396
# MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1
# 34475586 51112588
# MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3
# 147834968 136838760
# MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2
# 34136142 136271358
# MCF7_dTAGGATA522_GATA_dE_rep3
# 85665512
GATA.size.factors = estimateSizeFactorsForMatrix(GATA.SF) # read depth transformed to size factor
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
# 0.5385594 0.5169334
#MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1
# 0.5469193 0.8108480
#MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3
# 2.3452478 2.1708044
#MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2
# 0.5415343 2.1618032
#MCF7_dTAGGATA522_GATA_dE_rep3
# 1.3589941
sizeFactors(deseq.counts.table) <- GATA.size.factors # assign to each column of the count matrix (deseq.counts.table) the size factor to bring each column to a common scale
dds <- DESeq(deseq.counts.table)
normalized.counts.GATA3 = counts(dds, normalized=TRUE)
head(normalized.counts.GATA3)
peak.intensities = rowMeans(normalized.counts.GATA3[,1:3]) # we want to get the average read counts for CC groups
names(peak.intensities) = rownames(normalized.counts.GATA3)
save.image('240108_GATA3_ChIP_deseq.Rdata')
Subset peaks without the 8 GATA3-motifs, each with 20% intensity quantile
We can load the saved Rdata and look at each dataframe.
module load R/4.1.2
R
load('240108_GATA3_ChIP_deseq.Rdata')
Read this .bed file into R, and use DeSeq2 to count read size and parse into different quantile
head(peak.intensities)
#chr1:827180-827581 chr1:916569-916970 chr1:924653-925054
# 45.32469 36.69993 30.44684
# chr1:966453-966854 chr1:999308-999709 chr1:1000336-1000737
# 26.87016 21.22455 24.34263
length(peak.intensities)
#[1] 37308
quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))
# 20% 40% 60% 80% 100%
# 31.78709 40.05304 51.95512 80.03307 6270.21535
violin plot
This violin plot is showing the distribution of peak intensity (log10).
library(lattice)
log10quantiles <- quantile(log(abs(peak.intensities), base = 10), probs = seq(.20, 1.00, by = .20))
png('violinplot_GATA3_ChIP_peak_normalized_intensity.png')
print(bwplot(log(abs(peak.intensities), base = 10) ~ factor("1"),
main = "Violin-Like Plot",
panel = function(x, ...) {
panel.violin(x, ...)
panel.abline(h = log10quantiles, col = "red", lty = 2)
},
xlab = "", ylab = "log10 normalized Intensity")
)
dev.off()
Figure 8: normalized GATA3 ChIP peak intensity
This violin plot represents the distribution and probability density of (log10) normalized peak intensity (for peak without motif 1~8). Each red dotted line indicates the quantile cutoff of 20%, which can help us to visualize the peak intensity distribution across different quantile.
Parse peaks into 5 intensity quantile by 20%.
chr = sapply(strsplit(names(peak.intensities), ":"), "[", 1)
rnge = sapply(strsplit(names(peak.intensities), ":"), "[", 2)
start = as.numeric(sapply(strsplit(rnge, "-"), "[", 1)) + 200
end = as.numeric(sapply(strsplit(rnge, "-"), "[", 2)) - 200
quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))
# 1bp summit quantile file
j =0
q=seq(.20, 1.00, by = .20)
count=0
for (i in quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))){
count = count +1
write.table(file = paste0('quantile', as.character(q[count]), '_summits.bed'), data.frame(chr[peak.intensities > j & peak.intensities <= i], start[peak.intensities > j & peak.intensities <= i], end[peak.intensities > j & peak.intensities <= i], peak.intensities[peak.intensities > j & peak.intensities <= i]), sep = '\t', quote=FALSE, col.names=FALSE, row.names=FALSE )
j = i
}
for i in quantile*.bed
do
wc -l $i
done
## 7269 quantile0.2_summits.bed
## 7461 quantile0.4_summits.bed
## 7462 quantile0.6_summits.bed
## 7461 quantile0.8_summits.bed
## 7462 quantile1_summits.bed
See if the number of input samples will change the normalized reads.
I am loading fewer samples (only two condition, GATA_CC and GATA_dE)
GATA.analysis.regions2=GATA.analysis.regions[,c(1,2,3,7,8,9)]
colnames(GATA.analysis.regions2)
#[1] "MCF7_dTAGGATA522_GATA_CC_rep1" "MCF7_dTAGGATA522_GATA_CC_rep2"
#[3] "MCF7_dTAGGATA522_GATA_CC_rep3" "MCF7_dTAGGATA522_GATA_dE_rep1"
#[5] "MCF7_dTAGGATA522_GATA_dE_rep2" "MCF7_dTAGGATA522_GATA_dE_rep3"
library(DESeq2)
sample.conditions2 = factor(sapply(strsplit(colnames(GATA.analysis.regions2), '_rep'), '[', 1))
sample.conditions2
#[1] MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC
#[4] MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE
#Levels: MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_dE
deseq.counts.table2 = DESeqDataSetFromMatrix(countData = GATA.analysis.regions2, # DESeqDataSet needs countData to be non-negative integers; non-normalized counts are integer, normalized signals has decimals.
colData = as.data.frame(sample.conditions2),
design = ~ sample.conditions2)
GATA.SF2 <- read.table("GATA_reads.txt", sep = '\t', header = TRUE)[,-1][,c(1,2,3,7,8,9)]
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
# 33948616 32585396
# MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_dE_rep1
# 34475586 34136142
# MCF7_dTAGGATA522_GATA_dE_rep2 MCF7_dTAGGATA522_GATA_dE_rep3
# 136271358 85665512
GATA.size.factors2 = estimateSizeFactorsForMatrix(GATA.SF2)
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
# 0.6821163 0.6547257
#MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_dE_rep1
# 0.6927045 0.6858842
#MCF7_dTAGGATA522_GATA_dE_rep2 MCF7_dTAGGATA522_GATA_dE_rep3
# 2.7380474 1.7212438
sizeFactors(deseq.counts.table2) <- GATA.size.factors2 # assign to each column of the count matrix (deseq.counts.table) the size factor to bring each column to a common scale
dds <- DESeq(deseq.counts.table2)
normalized.counts.GATA3.2 = counts(dds, normalized=TRUE)
head(normalized.counts.GATA3.2)
peak.intensities2 = rowMeans(normalized.counts.GATA3.2[,1:3])
head(peak.intensities)
# chr1:827180-827581 chr1:916569-916970 chr1:924653-925054
# 45.32469 36.69993 30.44684
# chr1:966453-966854 chr1:999308-999709 chr1:1000336-1000737
# 26.87016 21.22455 24.34263
head(peak.intensities2)
# chr1:827180-827581 chr1:916569-916970 chr1:924653-925054
# 35.78574 28.97613 24.03906
# chr1:966453-966854 chr1:999308-999709 chr1:1000336-1000737
# 21.21512 16.75767 19.21953
save.image('240108_GATA3_ChIP_deseq_test.Rdata')
The alteration in input samples affected the size factor determined by DESeq2’s estimateSizeFactorsForMatrix function, consequently impacting the normalized intensity of reads for each sample. However, I don’t believe it is altering the relative differences within each sample—such as which peak exhibits the highest intensity and which exhibits less.
Coherence check-follow up:
module load R/4.1.2
R
load('240108_GATA3_ChIP_deseq_test.Rdata')
chr = sapply(strsplit(names(peak.intensities2), ":"), "[", 1)
rnge = sapply(strsplit(names(peak.intensities2), ":"), "[", 2)
start = as.numeric(sapply(strsplit(rnge, "-"), "[", 1)) + 200
end = as.numeric(sapply(strsplit(rnge, "-"), "[", 2)) - 200
quantile(peak.intensities2, probs = seq(.20, 1.00, by = .20))
# 1bp summit quantile file
j =0
q=seq(.20, 1.00, by = .20)
count=0
for (i in quantile(peak.intensities2, probs = seq(.20, 1.00, by = .20))){
count = count +1
write.table(file = paste0('test_quantile', as.character(q[count]), '_summits.bed'), data.frame(chr[peak.intensities2 > j & peak.intensities2 <= i], start[peak.intensities2 > j & peak.intensities2 <= i], end[peak.intensities2 > j & peak.intensities2 <= i], peak.intensities2[peak.intensities2 > j & peak.intensities2 <= i]), sep = '\t', quote=FALSE, col.names=FALSE, row.names=FALSE )
j = i
}
See besides the normalized intensity, if peak coordinates assigned to each quantile is the same.
for i in test_quantile*.bed
do
wc -l $i
done
#7269 test_quantile0.2_summits.bed
#7461 test_quantile0.4_summits.bed
#7462 test_quantile0.6_summits.bed
#7461 test_quantile0.8_summits.bed
#7462 test_quantile1_summits.bed
awk '{OFS="\t"} {print $1, $2, $3}' quantile1_summits.bed > test1.bed
awk '{OFS="\t"} {print $1, $2, $3}' test_quantile1_summits.bed > test2.bed
diff test1.bed test2.bed
No difference between the peak regions assigned to each quantile files, despite the difference in the normalized counts.
In this section, we use bedtools closestBed (refer to: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to find the closest GAT to each provided peak summit.
Input:
Input2 -a is the sorted peak summit file (centered 1bp);
Input2 -b is the sorted, and concatenated GAT coordinates file (both plus and minus);
GAT coordinates on full hg38 use read size==1000.
dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/
head ${dir}hg38.3.3.3minus.14_GAT.bed
head ${dir}hg38.3.3.3plus.36_GAT.bed
head ${dir}hg38.3.3.3minus.36_ATC.bed
head ${dir}hg38.3.3.3plus.14_ATC.bed
we will concatenate the plus and minus 3mer file together.
cat ${dir}hg38.3.3.3minus.14_GAT.bed ${dir}hg38.3.3.3plus.36_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.bed
cat ${dir}hg38.3.3.3minus.36_ATC.bed ${dir}hg38.3.3.3plus.14_ATC.bed > hg38.3.3.3.30_plus_minus_ATC.bed
wc -l ${dir}hg38.3.3.3minus.14_GAT.bed #38717592
wc -l ${dir}hg38.3.3.3plus.36_GAT.bed #39090736
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #77808328
#38717592+39090736
#[1] 77808328
wc -l ${dir}hg38.3.3.3minus.36_ATC.bed #39086237
wc -l ${dir}hg38.3.3.3plus.14_ATC.bed #38725373
wc -l hg38.3.3.3.30_plus_minus_ATC.bed #77811610
#39086237+38725373
#[1] 77811610
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #77808328
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1 10545 10548 14 14 - GAT
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11576 11579 14 14 - GAT
#chr1 11597 11600 14 14 - GAT
load files contains 3mer coordinates info
#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 10545 10548 14 14 - GAT
#2 chr1 11145 11148 14 14 - GAT
#3 chr1 11160 11163 14 14 - GAT
#4 chr1 11576 11579 14 14 - GAT
#5 chr1 11597 11600 14 14 - GAT
#6 chr1 11846 11849 14 14 - GAT
tail(all.GAT.file)
# V1 V2 V3 V4 V5 V6 V7
#77808323 chrY_KI270740v1_random 36665 36668 36 36 + GAT
#77808324 chrY_KI270740v1_random 36790 36793 36 36 + GAT
#77808325 chrY_KI270740v1_random 36885 36888 36 36 + GAT
#77808326 chrY_KI270740v1_random 36913 36916 36 36 + GAT
#77808327 chrY_KI270740v1_random 37038 37041 36 36 + GAT
#77808328 chrY_KI270740v1_random 37163 37166 36 36 + GAT
nrow(all.GAT.file)
#[1] 77808328
load files contains selected ChIP peak summit info
chip.peak.summit.quantile1=read.table("quantile1_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile1)
#[1] 7462
head(chip.peak.summit.quantile1)
#V1 V2 V3 V4
#1 chr1 5598187 5598188 271.93122
#2 chr1 8017660 8017661 676.31362
#3 chr1 8020178 8020179 324.95549
#4 chr1 8055212 8055213 111.55874
#5 chr1 8061475 8061476 98.51927
#6 chr1 8120791 8120792 124.49564
chip.peak.summit.quantile0.8=read.table("quantile0.8_summits.bed", header=FALSE)
chip.peak.summit.quantile0.6=read.table("quantile0.6_summits.bed", header=FALSE)
chip.peak.summit.quantile0.4=read.table("quantile0.4_summits.bed", header=FALSE)
chip.peak.summit.quantile0.2=read.table("quantile0.2_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile0.8)
#[1] 7461
head(chip.peak.summit.quantile0.8)
# V1 V2 V3 V4
#1 chr1 1073822 1073823 56.70255
#2 chr1 5637335 5637336 63.60992
#3 chr1 7019683 7019684 72.40249
#4 chr1 7494212 7494213 75.17251
#5 chr1 8026313 8026314 69.95015
#6 chr1 8070509 8070510 77.97241
summary(chip.peak.summit.quantile0.8$V4)
#Min. 1st Qu. Median Mean 3rd Qu. Max.
# 51.96 56.34 61.88 63.24 69.39 80.03
nrow(chip.peak.summit.quantile0.6)
#[1] 7462
head(chip.peak.summit.quantile0.6)
# V1 V2 V3 V4
#1 chr1 827380 827381 45.32469
#2 chr1 1074184 1074185 48.18883
#3 chr1 1124966 1124967 45.46610
#4 chr1 1157747 1157748 44.19288
#5 chr1 3587455 3587456 51.91936
#6 chr1 3668981 3668982 40.55495
summary(chip.peak.summit.quantile0.6$V4)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 40.05 42.60 45.44 45.63 48.55 51.95
nrow(chip.peak.summit.quantile0.4)
#[1] 7461
head(chip.peak.summit.quantile0.4)
# V1 V2 V3 V4
#1 chr1 916769 916770 36.69993
#2 chr1 1013265 1013266 34.70221
#3 chr1 1013580 1013581 36.26922
#4 chr1 1021142 1021143 32.69951
#5 chr1 1158192 1158193 36.21743
#6 chr1 1208541 1208542 38.65084
summary(chip.peak.summit.quantile0.4$V4)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 31.79 33.73 35.71 35.85 37.98 40.05
nrow(chip.peak.summit.quantile0.2)
#[1] 7269
head(chip.peak.summit.quantile0.2)
# V1 V2 V3 V4
#1 chr1 924853 924854 30.44684
#2 chr1 966653 966654 26.87016
#3 chr1 999508 999509 21.22455
#4 chr1 1000536 1000537 24.34263
#5 chr1 1001891 1001892 25.52872
#6 chr1 1020187 1020188 20.63151
summary(chip.peak.summit.quantile0.2$V4)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.6095 24.2296 27.3925 26.2179 29.8468 31.7863
bedtools closestBed
The below function will sort input bed1 and bed2 first, then run bedtools closestBed between bed1 and bed2.
# define function
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not use scientific notation when writing out
#write bed formatted data.frames to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
# the command sort a and b file by coordinates
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
# the command call closestBed on bed1 and bed2
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
# remove intermediate files
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.
Find the closest GAT to peak summit regardless of GAT strandedness
closest.1stGAT.to.GATA.peak.quantile1=bedTools.closest(bed1 = chip.peak.summit.quantile1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.8=bedTools.closest(bed1 = chip.peak.summit.quantile0.8[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.6=bedTools.closest(bed1 = chip.peak.summit.quantile0.6[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.4=bedTools.closest(bed1 = chip.peak.summit.quantile0.4[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.2=bedTools.closest(bed1 = chip.peak.summit.quantile0.2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(closest.1stGAT.to.GATA.peak.quantile1,file= 'closest.1stGAT.to.GATA.peak.quantile1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.8,file= 'closest.1stGAT.to.GATA.peak.quantile0.8.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.6,file= 'closest.1stGAT.to.GATA.peak.quantile0.6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.4,file= 'closest.1stGAT.to.GATA.peak.quantile0.4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.2,file= 'closest.1stGAT.to.GATA.peak.quantile0.2.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('240108_GATA3_ChIP_clsestbed.Rdata')
load files contains the 4 MCF7 DHS negative controls.
“full DHS control”
MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed #92388
MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed #139899
“independent DHS control”
MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed #69119
MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed #113774
# DHS negative ctrl (convert to 1bp summit at center)
library(bigWig)
full.DHS.control.rep1=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(full.DHS.control.rep1)
#[1] 92388
head(full.DHS.control.rep1)
# V1 V2 V3
#1 chr1 268044 268045
#2 chr1 586135 586136
#3 chr1 629975 629976
#4 chr1 630235 630236
#5 chr1 630555 630556
#6 chr1 631455 631456
full.DHS.control.rep2=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(full.DHS.control.rep2)
#[1] 139899
head(full.DHS.control.rep2)
# V1 V2 V3
#1 chr1 10255 10256
#2 chr1 16235 16236
#3 chr1 267984 267985
#4 chr1 586135 586136
#5 chr1 630255 630256
#6 chr1 630555 630556
indep.DHS.control.rep1=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.rep1)
#[1] 69119
head(indep.DHS.control.rep1)
# V1 V2 V3
#1 chr1 268044 268045
#2 chr1 586135 586136
#3 chr1 629975 629976
#4 chr1 630235 630236
#5 chr1 630555 630556
#6 chr1 631455 631456
indep.DHS.control.rep2=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.rep2)
#[1] 113774
head(indep.DHS.control.rep2)
# V1 V2 V3
#1 chr1 10255 10256
#2 chr1 16235 16236
#3 chr1 267984 267985
#4 chr1 586135 586136
#5 chr1 630255 630256
#6 chr1 630555 630556
calculate the distance between the closest GAT to DHS regions’ summits.
closest.1stGAT.to.full.DHS.control.rep1=bedTools.closest(bed1 = full.DHS.control.rep1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.full.DHS.control.rep2=bedTools.closest(bed1 = full.DHS.control.rep2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.indep.DHS.control.rep1=bedTools.closest(bed1 = indep.DHS.control.rep1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.indep.DHS.control.rep2=bedTools.closest(bed1 = indep.DHS.control.rep2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(closest.1stGAT.to.full.DHS.control.rep1,file= 'closest.1stGAT.to.full.DHS.control.rep1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.full.DHS.control.rep2,file= 'closest.1stGAT.to.full.DHS.control.rep2.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.indep.DHS.control.rep1,file= 'closest.1stGAT.to.indep.DHS.control.rep1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.indep.DHS.control.rep2,file= 'closest.1stGAT.to.indep.DHS.control.rep2.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('240108_GATA3_ChIP_clsestbed2.Rdata')
negative control comparison
First, we want to ask if the MCF7 DHS controls are comparable to each other.
Combine info to the same data frame: negative control (MCF7 DHS regions) info + distance +status.
df.neg = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.neg) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./*DHS.control.rep*.bed"))) {
print(chip.peak)
rep.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.')[[1]][2]), ".bed")[[1]][1]
print(rep.name)
all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], rep.name)
df.neg = rbind(df.neg, all.distance)
}
## [1] "./closest.1stGAT.to.full.DHS.control.rep1.bed"
## [1] "full.DHS.control.rep1"
## [1] "./closest.1stGAT.to.full.DHS.control.rep2.bed"
## [1] "full.DHS.control.rep2"
## [1] "./closest.1stGAT.to.indep.DHS.control.rep1.bed"
## [1] "indep.DHS.control.rep1"
## [1] "./closest.1stGAT.to.indep.DHS.control.rep2.bed"
## [1] "indep.DHS.control.rep2"
str(df.neg)
## 'data.frame': 415180 obs. of 5 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 268044 586135 629975 630235 630555 631455 631735 632315 633015 634475 ...
## $ V3 : int 268045 586136 629976 630236 630556 631456 631736 632316 633016 634476 ...
## $ V11 : int 2 8 38 31 4 39 44 4 11 0 ...
## $ rep.name: chr "full.DHS.control.rep1" "full.DHS.control.rep1" "full.DHS.control.rep1" "full.DHS.control.rep1" ...
colnames(df.neg) = c("V1","V2","V3", "dis", "status")
unique(df.neg$status)
## [1] "full.DHS.control.rep1" "full.DHS.control.rep2" "indep.DHS.control.rep1"
## [4] "indep.DHS.control.rep2"
head(df.neg)
## V1 V2 V3 dis status
## 1 chr1 268044 268045 2 full.DHS.control.rep1
## 2 chr1 586135 586136 8 full.DHS.control.rep1
## 3 chr1 629975 629976 38 full.DHS.control.rep1
## 4 chr1 630235 630236 31 full.DHS.control.rep1
## 5 chr1 630555 630556 4 full.DHS.control.rep1
## 6 chr1 631455 631456 39 full.DHS.control.rep1
CDF plot
library(lattice)
library(latticeExtra)
df.neg$status = factor(df.neg$status, levels = c("full.DHS.control.rep1", "full.DHS.control.rep2", "indep.DHS.control.rep1", "indep.DHS.control.rep2"))
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.neg,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("blue4","blue"))(2), colorRampPalette(c("grey60","grey30"))(2)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("blue4","blue"))(2), colorRampPalette(c("grey60","grey30"))(2)), lwd=3), strip.background=list(col="grey85"))
)
The blue traces are the two reps from full MCF7 DHS regions, the grey traces are the two reps from MCF7 HDS regions without overlapped GATA3 peak regions.
Four negative traces looks very similar to each other. By eye, two traces from rep1 are slightly leftward shifted than two traces from rep2.
Kolmogorov–Smirnov test
KS test is used to compare distributions. Notice that it is a pretty sensitive test.
The ks.test function in R returns a test statistic and a p-value. We can interpret the p-value to determine whether to reject the null hypothesis (H0: two sample draw from populations with same distribution). If the p-value is less than than significance level (e.g., 0.05), we can reject the null hypothesis and conclude that the distributions are significantly different.
data1=abs(df.neg$dis)[df.neg$status == 'indep.DHS.control.rep1']
data2=abs(df.neg$dis)[df.neg$status == 'indep.DHS.control.rep2']
data3=abs(df.neg$dis)[df.neg$status == 'full.DHS.control.rep1']
data4=abs(df.neg$dis)[df.neg$status == 'full.DHS.control.rep2']
#rep1 vs. rep2
ks.test(data1 , data2)
## Warning in ks.test.default(data1, data2): p-value will be approximate in the
## presence of ties
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: data1 and data2
## D = 0.026744, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(data3 , data4)
## Warning in ks.test.default(data3, data4): p-value will be approximate in the
## presence of ties
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: data3 and data4
## D = 0.029757, p-value < 2.2e-16
## alternative hypothesis: two-sided
# full vs. indep (within each rep)
ks.test(data1 , data3)
## Warning in ks.test.default(data1, data3): p-value will be approximate in the
## presence of ties
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: data1 and data3
## D = 0.0056755, p-value = 0.1565
## alternative hypothesis: two-sided
ks.test(data2 , data4)
## Warning in ks.test.default(data2, data4): p-value will be approximate in the
## presence of ties
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: data2 and data4
## D = 0.0026765, p-value = 0.7597
## alternative hypothesis: two-sided
# default of ks.test() uses two-tail t test, which is what we wanted, because we do not restrict the test on whether one of the two dataset is larger/smaller than the other. If we had the expectation, we should use one-tail.
Looking at KS test p-value from different comparisons, we can tell that:
For the comparisons between rep1 and rep2, the p-value is very small (<0.001), this indicates that we can reject the NULL hypothesis (H0) and conclude that the data (distance between closest GAT to region summits) distribution are significantly different.
For the comparisons between the independent and full within same replicate, the KS test P-value is large which indicates that they are very likely to have same data distribution. Since the independent DHS control is the one we removed all overlapped GATA3 ChIP peak regions from the full DHS, the majority of data should be overlapped to each other, so we would expect to see this similar data distribution.
We also expect that the independent control compared to its complete version in the same rep will be more right-shifted. (in other word, has less GAT 3mer close to its summit). let’s check:
indep vs. full in rep1
library(lattice)
library(latticeExtra)
df.neg.rep1=rbind(df.neg[df.neg$status=="full.DHS.control.rep1",], df.neg[df.neg$status=="indep.DHS.control.rep1",])
df.neg.rep1$status = factor(df.neg.rep1$status, levels = c("full.DHS.control.rep1", "indep.DHS.control.rep1"))
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.neg.rep1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("blue4","grey60"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0, 2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("blue4","grey60"), lwd=3), strip.background=list(col="grey85"))
)
The two traces are mostly overlapped. By eye, blue trace (full) is slightly leftward shifted than the grey trace (indep).
Further validate if these traces converge/parallel to each other.
# Specify categories to compare within df.neg.rep1
match = ecdf(abs(df.neg.rep1$dis)[df.neg.rep1$status == 'indep.DHS.control.rep1'])
rep = ecdf(abs(df.neg.rep1$dis)[df.neg.rep1$status == 'full.DHS.control.rep1'])
match.y = seq(0, 1600, by=1) # creating indices
rep.y = seq(0, 1600, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 38
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'full.DHS.control.rep1 CDF - indep.DHS.control.rep1 CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'full.DHS.control.rep1 CDF - indep.DHS.control.rep1 CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
indep vs. full in rep2
library(lattice)
library(latticeExtra)
df.neg.rep2=rbind(df.neg[df.neg$status=="full.DHS.control.rep2",], df.neg[df.neg$status=="indep.DHS.control.rep2",])
df.neg.rep2$status = factor(df.neg.rep2$status, levels = c("full.DHS.control.rep2", "indep.DHS.control.rep2"))
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.neg.rep2,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("blue","grey30"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0, 2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("blue","grey30"), lwd=3), strip.background=list(col="grey85"))
)
The two traces are mostly overlapped.
Further validate if these traces converge/parallel to each other.
# Specify categories to compare within df.neg.rep2
match = ecdf(abs(df.neg.rep2$dis)[df.neg.rep2$status == 'indep.DHS.control.rep2'])
rep = ecdf(abs(df.neg.rep2$dis)[df.neg.rep2$status == 'full.DHS.control.rep2'])
match.y = seq(0, 1600, by=1) # creating indices
rep.y = seq(0, 1600, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 9
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab ='Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'full.DHS.control.rep2 CDF - indep.DHS.control.rep2) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'full.DHS.control.rep2 CDF - indep.DHS.control.rep2) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
Comparing the full DHS regions between rep1 and rep2.
library(lattice)
library(latticeExtra)
df.neg.full=rbind(df.neg[df.neg$status=="full.DHS.control.rep1",], df.neg[df.neg$status=="full.DHS.control.rep2",])
df.neg.full$status = factor(df.neg.full$status, levels = c("full.DHS.control.rep1", "full.DHS.control.rep2"))
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.neg.full,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0, 2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("blue4","blue"), lwd=3), strip.background=list(col="grey85"))
)
# Specify categories
match = ecdf(abs(df.neg$dis)[df.neg$status == 'full.DHS.control.rep2'])
rep = ecdf(abs(df.neg$dis)[df.neg$status == 'full.DHS.control.rep1'])
match.y = seq(0, 2300, by=1) # creating indices
rep.y = seq(0, 2300, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 32
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'full.DHS.control.rep1 CDF - full.DHS.control.rep2 CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'full.DHS.control.rep1 CDF - full.DHS.control.rep2 CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
In summary, within each replicate, the independent controls predominantly overlap with the full control. Notably, the full control exhibits a slight leftward shift compared to the independent control. This aligns with expectations, considering that the independent control omits the overlapped GATA3 peak regions, resulting in reduced bias toward GAT 3mer in proximity to the summit. Additionally, the fluctuations observed in the differences between the independent trace CDF and the full trace CDF imply minimal discrepancies.
On the other hand, when comparing the negative controls between replicate 2 and replicate 1, it’s discernible that GAT is closer to the summit of replicate 1 than that of replicate 2.
The above ks tests as well as cdf plots reveal that data (closest GAT distance to DHS regions summit) distribution is significantly different between rep1 and rep2 DHS. Thus, we want to take the intersect regions from the two replicates, then use the new sets’ summit to calculate distance to closest GAT as the consensus negative controls.
DHS peak intensity append from original data
head -5 hg38.rep1.intensity.bed
#chr1 99534644 99534794 11
#chr1 9942462 9942612 14
#chr1 99560724 99560874 13
#chr1 9942982 9943132 42
#chr1 9943342 9943492 25
head -5 hg38.rep2.intensity.bed
#chr1 99544544 99544694 17
#chr1 99559364 99559514 14
#chr1 9942462 9942612 36
#chr1 9943022 9943172 158
#chr1 9943322 9943472 225
Assign the intensity to full neg control rep1 and rep2:
intersect with peak that extend to 161bp and removed motifs123456_78
module load bedtools
intersectBed -wa -wb -a MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed -b hg38.rep1.intensity.bed | head -5
#chr1 267964 268125 chr1 267969 268119 11
#chr1 586055 586216 chr1 586060 586210 16
#chr1 629895 630056 chr1 629900 630050 41
#chr1 630155 630316 chr1 630160 630310 54
#chr1 630475 630636 chr1 630480 630630 95
intersectBed -wa -wb -a MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed -b hg38.rep1.intensity.bed | awk '{print $1, $2, $3, $7}' | uniq | wc -l #92398
intersectBed -wa -wb -a MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed -b hg38.rep1.intensity.bed | awk '{print $1, $2, $3}' | uniq | wc -l #92388
#There are 10 peaks with multiple regions assigned (and with more than one intensity value assigned)
intersectBed -wa -wb -a MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed -b hg38.rep1.intensity.bed | awk '{print $1, $2, $3}' | uniq -d
chr1 143972158 143972319
chr21 8811634 8811795
chr1 143972221 143972382
chr1 120851115 120851276
chr1 120850576 120850737
chr1 120850664 120850825
chr1 120851163 120851324
chr21 8811482 8811643
chr9 40992171 40992332
chr9 40992166 40992327
intersectBed -wa -wb -a MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed -b hg38.rep1.intensity.bed | grep 143972158
chr1 143972158 143972319 chr1 143972163 143972313 252
chr1 143972158 143972319 chr1 143972226 143972376 915
# the original DHS file, after liftover, it has some regions partially overlapped and has different intensity.
intersectBed -wa -wb -a MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed -b hg38.rep2.intensity.bed | head -5
#chr1 10175 10336 chr1 10180 10330 12
#chr1 16155 16316 chr1 16160 16310 11
#chr1 267904 268065 chr1 267909 268059 41
#chr1 586055 586216 chr1 586060 586210 66
#chr1 630175 630336 chr1 630180 630330 91
intersectBed -wa -wb -a MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed -b hg38.rep2.intensity.bed | awk '{print $1, $2, $3, $7}' | uniq | wc -l #139911
intersectBed -wa -wb -a MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed -b hg38.rep2.intensity.bed | awk '{print $1, $2, $3}' | uniq | wc -l #139899
intersectBed -wa -wb -a MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed -b hg38.rep2.intensity.bed | awk '{print $1, $2, $3}' | uniq -d
chr1 143972158 143972319
chr21 8811634 8811795
chr1 143972221 143972382
chr1 120851075 120851236
chr1 120850744 120850905
chr1 120851183 120851344
chr14 105999444 105999605
chr2 87396672 87396833
chr2 87396751 87396912
chr21 8811482 8811643
chr9 40992171 40992332
chr9 40992166 40992327
# the same 10 peak has multiple region/intensity assigned (but intensity are different)
intersectBed -wa -wb -a MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed -b hg38.rep2.intensity.bed | grep 143972158
chr1 143972158 143972319 chr1 143972163 143972313 377
chr1 143972158 143972319 chr1 143972226 143972376 1352
# rep1 with intensity
intersectBed -wa -wb -a MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed -b hg38.rep1.intensity.bed | awk '{print $1, $2, $3, $7}' OFS="\t"| uniq > MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed
#92398 MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed
head -5 MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed
#chr1 267964 268125 11
#chr1 586055 586216 16
#chr1 629895 630056 41
#chr1 630155 630316 54
#chr1 630475 630636 95
# rep2 with intensity
intersectBed -wa -wb -a MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed -b hg38.rep2.intensity.bed | awk '{print $1, $2, $3, $7}' OFS="\t" | uniq > MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed
#139911 MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed
head -5 MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed
#chr1 10175 10336 12
#chr1 16155 16316 11
#chr1 267904 268065 41
#chr1 586055 586216 66
#chr1 630175 630336 91
Venn plot
# rep1 peak (overlapped with rep2)
module load bedtools
intersectBed -wa -a MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed -b MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed | awk '{print $1, $2, $3}' OFS="\t"| uniq | wc -l #78703
# rep1 peak (non-overlapped with rep2)
intersectBed -v -wa -a MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed -b MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed | awk '{print $1, $2, $3}' OFS="\t"| uniq | wc -l #13685
#13685+78703
#[1] 92388
139899-78703=61196
# rep2 peak (overlapped with rep1)
intersectBed -wa -a MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed -b MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed | awk '{print $1, $2, $3}' OFS="\t"| uniq | wc -l #78779
# rep2 peak (non-overlapped with rep1)
intersectBed -v -wa -a MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed -b MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed | awk '{print $1, $2, $3}' OFS="\t"| uniq | wc -l #61120
#78779+61120
#[1] 139899
92388-78779=13609
Figure 9: venn plot for the two DHS replicates
The above venn plot is showing the intersected regions while intersecting rep1 vs. rep2 (count the original entry in rep1 for each overlap); the below venn plot is showing the intersected regions while intersecting rep2 vs. rep1 (count the original entry in rep2 for each overlap).
bw plot
# rep1 peak (overlapped with rep2)
module load bedtools
intersectBed -wa -a MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed -b MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed > overlap_rep1.bed
# rep1 peak (non-overlapped with rep2)
intersectBed -v -wa -a MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed -b MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed > non_overlap_rep1.bed
# rep2 peak (overlapped with rep1)
intersectBed -wa -a MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed -b MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed > overlap_rep2.bed
# rep2 peak (non-overlapped with rep1)
intersectBed -v -wa -a MCF7DHS_Rep2_161bp_without_motifs_123456_78_with_intensity.bed -b MCF7DHS_Rep1_161bp_without_motifs_123456_78_with_intensity.bed > non_overlap_rep2.bed
library(lattice)
library(latticeExtra)
bw.df = data.frame(matrix(nrow = 0, ncol = 5))
for (DHS.peak in Sys.glob(file.path("./*overlap_rep*.bed"))) {
print(DHS.peak)
status=strsplit(strsplit(DHS.peak, "/")[[1]][2], ".bed")[[1]]
print(status)
temp = cbind(read.table(DHS.peak,header=F, comment.char=''), status)
bw.df = rbind(bw.df, temp)
}
## [1] "./non_overlap_rep1.bed"
## [1] "non_overlap_rep1"
## [1] "./non_overlap_rep2.bed"
## [1] "non_overlap_rep2"
## [1] "./overlap_rep1.bed"
## [1] "overlap_rep1"
## [1] "./overlap_rep2.bed"
## [1] "overlap_rep2"
colnames(bw.df) = c("V1","V2","v3", "intensity", "status")
bw.df$status = factor(bw.df$status, levels = c("non_overlap_rep1", "overlap_rep1", "overlap_rep2", "non_overlap_rep2"))
str(bw.df)
## 'data.frame': 232630 obs. of 5 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 629895 913795 938535 999555 1070415 1122975 1360775 1471095 2397136 2843850 ...
## $ v3 : int 630056 913956 938696 999716 1070576 1123136 1360936 1471256 2397297 2844011 ...
## $ intensity: int 41 12 17 10 20 12 10 11 16 9 ...
## $ status : Factor w/ 4 levels "non_overlap_rep1",..: 1 1 1 1 1 1 1 1 1 1 ...
unique(bw.df$status)
## [1] non_overlap_rep1 non_overlap_rep2 overlap_rep1 overlap_rep2
## Levels: non_overlap_rep1 overlap_rep1 overlap_rep2 non_overlap_rep2
bw.df$rep <- apply(bw.df, 1, function(row) sub(".*_(rep\\d+)", "\\1", row["status"]))
unique(bw.df$rep)
## [1] "rep1" "rep2"
bw.df$supp <- apply(bw.df, 1, function(row) sub("^(.*)_rep\\d+$", "\\1", row["status"]))
# Create a bw plot using lattice
bwplot(intensity ~ supp | rep,
data = bw.df,
ylim=c(-50,400),
do.out=FALSE,
main = "DHS regions (complete) intensity")
If we only take intersection of rep1 and rep2 DHS regions, we know that the majority in the intersection has higher intensity compared to the disgarded regions (non-overlapped).
closestBed
#cd /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/ENCODE_DHS_GSE29692
# "full DHS control"
head -5 MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed #92388
#chr1 267964 268125
#chr1 586055 586216
#chr1 629895 630056
#chr1 630155 630316
#chr1 630475 630636
head -5 MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed #139899
#chr1 10175 10336
#chr1 16155 16316
#chr1 267904 268065
#chr1 586055 586216
#chr1 630175 630336
# "independent DHS control"
head -5 MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed #69119
#chr1 267964 268125
#chr1 586055 586216
#chr1 629895 630056
#chr1 630155 630316
#chr1 630475 630636
head -5 MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed #113774
#chr1 10175 10336
#chr1 16155 16316
#chr1 267904 268065
#chr1 586055 586216
#chr1 630175 630336
# consensus DHS control-full
module load bedtools
intersectBed -a MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed -b MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed > MCF7DHS_consensus_without_motifs_123456_78.bed
wc -l MCF7DHS_consensus_without_motifs_123456_78.bed #78858
head -5 MCF7DHS_consensus_without_motifs_123456_78.bed
#chr1 267964 268065
#chr1 586055 586216
#chr1 630175 630316
#chr1 630475 630636
#chr1 631375 631536
# consensus DHS control-indep
intersectBed -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed -b MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed > MCF7DHS_consensus_noGATA_without_motifs_123456_78.bed
wc -l MCF7DHS_consensus_noGATA_without_motifs_123456_78.bed #57906
head -5 MCF7DHS_consensus_noGATA_without_motifs_123456_78.bed
#chr1 267964 268065
#chr1 586055 586216
#chr1 630175 630316
#chr1 630475 630636
#chr1 631375 631536
full.DHS.control.consensus=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_consensus_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(full.DHS.control.consensus)
#[1] 78858
head(full.DHS.control.consensus)
# V1 V2 V3
#1 chr1 268014 268015
#2 chr1 586135 586136
#3 chr1 630245 630246
#4 chr1 630555 630556
#5 chr1 631455 631456
#6 chr1 631745 631746
indep.DHS.control.consensus=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_consensus_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.consensus)
#[1] 57906
head(indep.DHS.control.consensus)
# V1 V2 V3
#1 chr1 268014 268015
#2 chr1 586135 586136
#3 chr1 630245 630246
#4 chr1 630555 630556
#5 chr1 631455 631456
#6 chr1 631745 631746
closest.1stGAT.to.full.DHS.control.consensus=bedTools.closest(bed1 = full.DHS.control.consensus[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.indep.DHS.control.consensus=bedTools.closest(bed1 = indep.DHS.control.consensus[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(closest.1stGAT.to.full.DHS.control.consensus,file= 'closest.1stGAT.to.full.DHS.control.consensus.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.indep.DHS.control.consensus,file= 'closest.1stGAT.to.indep.DHS.control.consensus.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('240118_GATA3_ChIP_clsestbed4.Rdata')
An alternative positive control could be closest GAT distance to all peaks that contains GATA3-like motifs found by MEME/STREME. Load files contains all peak with GATA3-like motifs (1 to 8):
cd /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/
ls with_motifs_*mast.bed
head -5 with_motifs_1_101bp_mast.bed
#chr1 1080506 1080607 GATA_ChIP_peak_45 6.37157 chr1 1080566 1080576 840.25 9.55e-05 + 1
#chr1 1238187 1238288 GATA_ChIP_peak_51 16.8938 chr1 1238219 1238229 1413 9.62e-06 - 1
#chr1 1686381 1686482 GATA_ChIP_peak_63 35.0463 chr1 1686405 1686415 1176 3.73e-05 - 1
#chr1 1741080 1741181 GATA_ChIP_peak_68 11.6965 chr1 1741170 1741180 1176 3.73e-05 - 1
#chr1 2001971 2002072 GATA_ChIP_peak_77 7.79887 chr1 2002016 2002026 1176 3.73e-05 + 1
with_motifs_1_101bp_mast.bed with_motifs_5_101bp_mast.bed
with_motifs_1_161bp_mast.bed with_motifs_5_161bp_mast.bed
with_motifs_2_101bp_mast.bed with_motifs_6_101bp_mast.bed
with_motifs_2_161bp_mast.bed with_motifs_6_161bp_mast.bed
with_motifs_3_101bp_mast.bed with_motifs_7_101bp_mast.bed
with_motifs_3_161bp_mast.bed with_motifs_7_161bp_mast.bed
with_motifs_4_101bp_mast.bed with_motifs_8_101bp_mast.bed
with_motifs_4_161bp_mast.bed
# positive control-all
library(bigWig)
pos.control.peak.with.all=data.frame()
for (mast.bed in Sys.glob(file.path("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/", paste("with_motifs_", "*mast.bed", sep ='')))) {
temp=read.table(mast.bed, header=FALSE)[, 1:5] #only take peak regions
uniq.peak=temp[!duplicated(temp$V4), ] # make sure that the peak regions are unique
uniq.peak.center=center.bed(uniq.peak, upstreamWindow = 0, downstreamWindow = 0) #take the summit
pos.control.peak.with.all=rbind(pos.control.peak.with.all, uniq.peak.center)
}
nrow(pos.control.peak.with.all)
#[1] 59560
head(pos.control.peak.with.all)
# V1 V2 V3 V4 V5
#1 chr1 1080556 1080557 GATA_ChIP_peak_45 6.37157
#2 chr1 1238237 1238238 GATA_ChIP_peak_51 16.89380
#3 chr1 1686431 1686432 GATA_ChIP_peak_63 35.04630
#4 chr1 1741130 1741131 GATA_ChIP_peak_68 11.69650
#5 chr1 2002021 2002022 GATA_ChIP_peak_77 7.79887
#6 chr1 3127377 3127378 GATA_ChIP_peak_85 501.39400
closest.1stGAT.to.pos.control.motif.all=bedTools.closest(bed1 = pos.control.peak.with.all[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(closest.1stGAT.to.pos.control.motif.all,file= 'closest.1stGAT.to.pos.control.motif.all.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('240118_GATA3_ChIP_clsestbed5.Rdata')
We are generating a cumulative distribution function to assess the proximity of a 3mer GAT to the summit of GATA3 ChIP peaks, potentially utilizing 3mer GAT as a binding sequence, across five quantile files. This evaluation aims to determine if the 3mer GAT sequences are closer to the summit of GATA3 ChIP peaks than they are to the summit of the MCF7 DHS regions, which consist of four files. Additionally, we possess four positive control peaks containing GATA3 motifs with known structures.
In previous analysis, we have used closestBed to find the closest GAT to peak summit (5 quantile peak sets, 4 MCF7 DHS negative control, and 4 positive controls).
GATA3 peaks
closest.1stGAT.to.GATA.peak.quantile0.2.bed
closest.1stGAT.to.GATA.peak.quantile0.4.bed
closest.1stGAT.to.GATA.peak.quantile0.6.bed
closest.1stGAT.to.GATA.peak.quantile0.8.bed
closest.1stGAT.to.GATA.peak.quantile1.bed
These files saved info of 1st closest GAT to GATA3 peak summits. These GATA3 peaks are peaks without MEME/MOTIF found motifs and have ranked by peak intensity.
Notice that V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.
closest.1stGAT.to.GATA.peak.quantile1=read.table('closest.1stGAT.to.GATA.peak.quantile1.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
closest.1stGAT.to.GATA.peak.quantile0.8=read.table('closest.1stGAT.to.GATA.peak.quantile0.8.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile0.8)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 1073822 1073823 chr1 1073808 1073811 36 36 + GAT 12
## 2 chr1 5637335 5637336 chr1 5637333 5637336 36 36 + GAT 0
## 3 chr1 7019683 7019684 chr1 7019698 7019701 14 14 - GAT 15
## 4 chr1 7494212 7494213 chr1 7494203 7494206 14 14 - GAT 7
## 5 chr1 8026313 8026314 chr1 8026243 8026246 14 14 - GAT 68
## 6 chr1 8070509 8070510 chr1 8070513 8070516 36 36 + GAT 4
closest.1stGAT.to.GATA.peak.quantile0.6=read.table('closest.1stGAT.to.GATA.peak.quantile0.6.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile0.6)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 827380 827381 chr1 827390 827393 14 14 - GAT 10
## 2 chr1 1074184 1074185 chr1 1074157 1074160 14 14 - GAT 25
## 3 chr1 1124966 1124967 chr1 1125026 1125029 36 36 + GAT 60
## 4 chr1 1157747 1157748 chr1 1157663 1157666 14 14 - GAT 82
## 5 chr1 3587455 3587456 chr1 3587417 3587420 14 14 - GAT 36
## 6 chr1 3668981 3668982 chr1 3669053 3669056 14 14 - GAT 72
closest.1stGAT.to.GATA.peak.quantile0.4=read.table('closest.1stGAT.to.GATA.peak.quantile0.4.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile0.4)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 916769 916770 chr1 916719 916722 36 36 + GAT 48
## 2 chr1 1013265 1013266 chr1 1013247 1013250 36 36 + GAT 16
## 3 chr1 1013580 1013581 chr1 1013584 1013587 36 36 + GAT 4
## 4 chr1 1021142 1021143 chr1 1021115 1021118 14 14 - GAT 25
## 5 chr1 1158192 1158193 chr1 1158274 1158277 14 14 - GAT 82
## 6 chr1 1208541 1208542 chr1 1208554 1208557 36 36 + GAT 13
closest.1stGAT.to.GATA.peak.quantile0.2=read.table('closest.1stGAT.to.GATA.peak.quantile0.2.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile0.2)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 924853 924854 chr1 924851 924854 14 14 - GAT 0
## 2 chr1 966653 966654 chr1 966718 966721 36 36 + GAT 65
## 3 chr1 999508 999509 chr1 999569 999572 14 14 - GAT 61
## 4 chr1 1000536 1000537 chr1 1000627 1000630 36 36 + GAT 91
## 5 chr1 1001891 1001892 chr1 1001937 1001940 36 36 + GAT 46
## 6 chr1 1020187 1020188 chr1 1020330 1020333 36 36 + GAT 143
negative control
closest.1stGAT.to.full.DHS.control.consensus.bed
closest.1stGAT.to.indep.DHS.control.consensus.bed
closest.1stGAT.to.full.DHS.control.consensus=read.table('closest.1stGAT.to.full.DHS.control.consensus.bed',header=F, comment.char='')
closest.1stGAT.to.indep.DHS.control.consensus=read.table('closest.1stGAT.to.indep.DHS.control.consensus.bed',header=F, comment.char='')
head(closest.1stGAT.to.full.DHS.control.consensus)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 268014 268015 chr1 268046 268049 36 36 + GAT 32
## 2 chr1 586135 586136 chr1 586125 586128 36 36 + GAT 8
## 3 chr1 630245 630246 chr1 630266 630269 14 14 - GAT 21
## 4 chr1 630555 630556 chr1 630559 630562 14 14 - GAT 4
## 5 chr1 631455 631456 chr1 631414 631417 14 14 - GAT 39
## 6 chr1 631745 631746 chr1 631779 631782 36 36 + GAT 34
nrow(closest.1stGAT.to.full.DHS.control.consensus)
## [1] 78858
nrow(closest.1stGAT.to.full.DHS.control.consensus[closest.1stGAT.to.full.DHS.control.consensus$V9=="+",])
## [1] 39551
nrow(closest.1stGAT.to.full.DHS.control.consensus[closest.1stGAT.to.full.DHS.control.consensus$V9=="-",])
## [1] 39307
head(closest.1stGAT.to.indep.DHS.control.consensus)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 268014 268015 chr1 268046 268049 36 36 + GAT 32
## 2 chr1 586135 586136 chr1 586125 586128 36 36 + GAT 8
## 3 chr1 630245 630246 chr1 630266 630269 14 14 - GAT 21
## 4 chr1 630555 630556 chr1 630559 630562 14 14 - GAT 4
## 5 chr1 631455 631456 chr1 631414 631417 14 14 - GAT 39
## 6 chr1 631745 631746 chr1 631779 631782 36 36 + GAT 34
nrow(closest.1stGAT.to.indep.DHS.control.consensus)
## [1] 57906
nrow(closest.1stGAT.to.indep.DHS.control.consensus[closest.1stGAT.to.indep.DHS.control.consensus$V9=="+",])
## [1] 29134
nrow(closest.1stGAT.to.indep.DHS.control.consensus[closest.1stGAT.to.indep.DHS.control.consensus$V9=="-",])
## [1] 28772
positive control
While plotting the closest GAT to peak summit, we can use all peaks that contains GATA3-like motifs (found by MEME/STREME) as the positive control. Thus the latter CDF plot will only show one trace of pos control which will make the plot more concise:
closest.1stGAT.to.pos.control.motif.all=read.table('closest.1stGAT.to.pos.control.motif.all.bed',header=F, comment.char='')
head(closest.1stGAT.to.pos.control.motif.all)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 845796 845797 chr1 845781 845784 36 36 + GAT 13
## 2 chr1 869497 869498 chr1 869497 869500 14 14 - GAT 0
## 3 chr1 917534 917535 chr1 917525 917528 14 14 - GAT 7
## 4 chr1 996149 996150 chr1 996149 996152 36 36 + GAT 0
## 5 chr1 1080556 1080557 chr1 1080566 1080569 36 36 + GAT 10
## 6 chr1 1238237 1238238 chr1 1238225 1238228 14 14 - GAT 10
nrow(closest.1stGAT.to.pos.control.motif.all)
## [1] 59560
nrow(closest.1stGAT.to.pos.control.motif.all[closest.1stGAT.to.pos.control.motif.all$V9=="+",])
## [1] 30122
nrow(closest.1stGAT.to.pos.control.motif.all[closest.1stGAT.to.pos.control.motif.all$V9=="-",])
## [1] 29438
GATA3 peak info + distance + status.
df.chip = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.chip) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./closest.1stGAT.to.GATA.peak.quantile*.bed"))) {
print(chip.peak)
quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.GATA.peak.')[[1]][2]), ".bed")[[1]][1]
print(quantile.name)
all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], quantile.name)
df.chip = rbind(df.chip, all.distance)
}
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./closest.1stGAT.to.GATA.peak.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame': 37115 obs. of 5 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 924853 966653 999508 1000536 1001891 1020187 1069549 1079599 1163013 1163246 ...
## $ V3 : int 924854 966654 999509 1000537 1001892 1020188 1069550 1079600 1163014 1163247 ...
## $ V11 : int 0 65 61 91 46 143 37 29 213 120 ...
## $ quantile.name: chr "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"
colnames(df.chip) = c("V1","V2","V3", "dis", "status")
head(df.chip)
## V1 V2 V3 dis status
## 1 chr1 924853 924854 0 quantile0.2
## 2 chr1 966653 966654 65 quantile0.2
## 3 chr1 999508 999509 61 quantile0.2
## 4 chr1 1000536 1000537 91 quantile0.2
## 5 chr1 1001891 1001892 46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
negative control info + distance +status.
df.neg.consensus = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.neg.consensus) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./closest.1stGAT.to.*DHS.control.consensus.bed"))) {
print(chip.peak)
rep.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.')[[1]][2]), ".consensus.bed")[[1]][1]
print(rep.name)
all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], rep.name)
df.neg.consensus = rbind(df.neg.consensus, all.distance)
}
## [1] "./closest.1stGAT.to.full.DHS.control.consensus.bed"
## [1] "full.DHS.control"
## [1] "./closest.1stGAT.to.indep.DHS.control.consensus.bed"
## [1] "indep.DHS.control"
str(df.neg.consensus)
## 'data.frame': 136764 obs. of 5 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 268014 586135 630245 630555 631455 631745 632285 633015 634495 634735 ...
## $ V3 : int 268015 586136 630246 630556 631456 631746 632286 633016 634496 634736 ...
## $ V11 : int 32 8 21 4 39 34 22 11 18 0 ...
## $ rep.name: chr "full.DHS.control" "full.DHS.control" "full.DHS.control" "full.DHS.control" ...
unique(df.neg.consensus$rep.name)
## [1] "full.DHS.control" "indep.DHS.control"
colnames(df.neg.consensus) = c("V1","V2","V3", "dis", "status")
head(df.neg.consensus)
## V1 V2 V3 dis status
## 1 chr1 268014 268015 32 full.DHS.control
## 2 chr1 586135 586136 8 full.DHS.control
## 3 chr1 630245 630246 21 full.DHS.control
## 4 chr1 630555 630556 4 full.DHS.control
## 5 chr1 631455 631456 39 full.DHS.control
## 6 chr1 631745 631746 34 full.DHS.control
positive control info + distance +status.
df.pos=read.table('closest.1stGAT.to.pos.control.motif.all.bed',header=F, comment.char='')[,c(1:3, 11)]
df.pos$status="pos.control.motif.all"
colnames(df.pos) = c("V1","V2","V3", "dis", "status")
str(df.pos)
## 'data.frame': 59560 obs. of 5 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 845796 869497 917534 996149 1080556 1238237 1505021 1686431 1741130 1746947 ...
## $ V3 : int 845797 869498 917535 996150 1080557 1238238 1505022 1686432 1741131 1746948 ...
## $ dis : int 13 0 7 0 10 10 28 0 0 1 ...
## $ status: chr "pos.control.motif.all" "pos.control.motif.all" "pos.control.motif.all" "pos.control.motif.all" ...
unique(df.pos$status)
## [1] "pos.control.motif.all"
head(df.pos)
## V1 V2 V3 dis status
## 1 chr1 845796 845797 13 pos.control.motif.all
## 2 chr1 869497 869498 0 pos.control.motif.all
## 3 chr1 917534 917535 7 pos.control.motif.all
## 4 chr1 996149 996150 0 pos.control.motif.all
## 5 chr1 1080556 1080557 10 pos.control.motif.all
## 6 chr1 1238237 1238238 10 pos.control.motif.all
merge files.
df.all=rbind(df.chip, df.neg.consensus, df.pos)
unique(df.all$status)
## [1] "quantile0.2" "quantile0.4" "quantile0.6"
## [4] "quantile0.8" "quantile1" "full.DHS.control"
## [7] "indep.DHS.control" "pos.control.motif.all"
df.all$status = factor(df.all$status, levels = c("quantile0.2", "quantile0.4", "quantile0.6", "quantile0.8", "quantile1", "full.DHS.control", "indep.DHS.control", "pos.control.motif.all"))
head(df.all)
## V1 V2 V3 dis status
## 1 chr1 924853 924854 0 quantile0.2
## 2 chr1 966653 966654 65 quantile0.2
## 3 chr1 999508 999509 61 quantile0.2
## 4 chr1 1000536 1000537 91 quantile0.2
## 5 chr1 1001891 1001892 46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
nrow(df.all)
## [1] 233439
all in one plot
library(lattice)
library(latticeExtra)
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "grey","grey30", "green"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "grey","grey30", "green"), lwd=3), strip.background=list(col="grey85"))
#panel = function(...) {
# panel.abline(v= 1.255, lty =2) #log10(17)=1.23 #log10(17)=1.255
# panel.ecdfplot(...)
#}
)
The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (grey: MCF7 DHS regions; green: positive control).
For positive control (green) vs. negative control (grey), positive control shows a significant left shift which indicates that GAT is found significantly closer to GATA3 peaks with MEME/STREME found GATA3-like motifs.
For GATA3 peaks without MEME/STREME found GATA3-like motifs (5 red traces), they also show leftward shift compared to negative controls (grey). Notice that peaks with higher intensity also shows greater left-shift.
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.neg.consensus,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("grey","grey30"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("grey","grey30"), lwd=3), strip.background=list(col="grey85"))
#panel = function(...) {
# panel.abline(v= , lty =2)
# panel.ecdfplot(...)
#}
)
# Specify categories
match = ecdf(abs(df.neg.consensus$dis)[df.neg.consensus$status == 'indep.DHS.control'])
rep = ecdf(abs(df.neg.consensus$dis)[df.neg.consensus$status == 'full.DHS.control'])
match.y = seq(0, 1450, by=1) # creating indices
rep.y = seq(0, 1450, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 9
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'full.DHS.control CDF - indep.DHS.control CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'full.DHS.control CDF - indep.DHS.control CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
df1= rbind(df.neg.consensus, df.pos)
df1$status = factor(df1$status, levels = c("full.DHS.control", "indep.DHS.control", "pos.control.motif.all"))
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("grey","grey30", "green"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("grey","grey30", "green"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 1.204, lty =2) #log10(16)=1.204
panel.ecdfplot(...)
}
)
A coherence check on the positive control vs. the two negative controls. We can observe a remarkble left-shift in the cdf plot, indicates that GAT is closer to peak summit in GATA3 peaks.
pos vs. full.neg
# Specify categories
match = ecdf(abs(df1$dis)[df1$status == 'full.DHS.control'])
rep = ecdf(abs(df1$dis)[df1$status == 'pos.control.motif.all'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 16
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'pos.control.motif.all CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'pos.control.motif.all CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
pos vs. indep.neg
# Specify categories
match = ecdf(abs(df1$dis)[df1$status == 'indep.DHS.control'])
rep = ecdf(abs(df1$dis)[df1$status == 'pos.control.motif.all'])
match.y = seq(0, 1450, by=1) # creating indices
rep.y = seq(0, 1450, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 16
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'pos.control.motif.all CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'pos.control.motif.all CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
The CDF differences between positive control and negative control reach to plateau at absolute distance==16bp. This is also saying that GAT tend to be “not randomly” found within 16bp upstream/downstream window of GATA3 peak summit. Beyond 16bp window, we have less confidence saying that GAT is significantly enriched in GATA3 peaks.
We can determine the fraction of peaks with closest GAT:
frac1=sum(df1[df1$status=="pos.control.motif.all",]$dis <= 16) / length(df1[df1$status=="pos.control.motif.all",]$dis)
frac2=sum(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis <= 16) / length(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis)
frac3=sum(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis <= 16) / length(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis)
frac1-frac2
## [1] 0.3696416
frac1-frac3
## [1] 0.3722632
The above code same as:
ecdf(df1[df1$status=="pos.control.motif.all",]$dis)(16)
## [1] 0.8063633
We can empirically determine the distance at which the CDFs’ difference between GATA3 peak set and ctrl peak set reach the plateaus (where CDF traces became parallel or converging).
unique(df.all$status)
## [1] quantile0.2 quantile0.4 quantile0.6
## [4] quantile0.8 quantile1 full.DHS.control
## [7] indep.DHS.control pos.control.motif.all
## 8 Levels: quantile0.2 quantile0.4 quantile0.6 quantile0.8 ... pos.control.motif.all
Comparing between MCF7 DHS regions (neg control) and GATA3 peak in quantile1 to determine the converging/parallel distance cutoff at ~13bp.
1) quantile1 vs. “full.DHS.control”
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 17
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
2) quantile1 vs. “indep.DHS.control”
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 18
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
By plotting the difference between the CDFs of GATA3 peaks (quantile 1) and the negative control, we can pinpoint the point where these two CDF curves converge or reach a similar trajectory.
When comparing GATA3 peaks within quantile 1 to the full.DHS.control (MCF7 cells), we observe that at a distance of 17 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.
A slightly larger distance cut-off (dis==18bp) was found when comparing with the indep.DHS.control. This is expected because we already know that indep neg control are more towards right compared to full neg control.
library(lattice)
library(latticeExtra)
df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg.consensus)
df1$status = factor(df1$status, levels = c("quantile1", "full.DHS.control", "indep.DHS.control"))
ecdfplot(~abs(dis), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance (absolute) from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("#ce228e", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 18, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance (absolute) from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("#ce228e", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 1.255, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 18, we empirically identify the point where the positive trace converge with the negative traces.
We can determine the fraction of peaks with closest GAT (19%):
frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 17) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis <= 17) / length(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis)
frac1-frac2
## [1] 0.1896786
frac3=sum(df.chip[df.chip$status=="quantile1",]$dis <= 18) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac4=sum(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis <= 18) / length(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis)
frac3-frac4
## [1] 0.1918092
Here, indep.DHS neg control serves as a good internal control for the full.DHS neg control. Since we know full.DHS contains some regions overlapped with GATA3 ChIP peaks, we would expect to see less regions been identified as regions closer to GAT (because some regions are cancelled out). The real fraction should be 19.18%.
1) quantile0.8 vs. “full.DHS.control”:
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.8'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 9
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
2) quantile0.8 vs. “indep.DHS.control”:
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.8'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 9
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
When comparing GATA3 peaks within quantile 0.8 to the two DHS control (MCF7 cells), we observe that at a distance of 9 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.
library(lattice)
library(latticeExtra)
df1=rbind(df.chip[df.chip$status=="quantile0.8",], df.neg.consensus)
df1$status = factor(df1$status, levels = c("quantile0.8", "full.DHS.control", "indep.DHS.control"))
ecdfplot(~abs(dis), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 9, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 0.954, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.8 compared with the negative control. At the marked red dotted line positioned at distance 9, we empirically identify the point where the two traces align in parallel.
We can determine the fraction of peaks with closest GAT (6.9%):
frac1=sum(df.chip[df.chip$status=="quantile0.8",]$dis <= 9) / length(df.chip[df.chip$status=="quantile0.8",]$dis)
frac2=sum(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis <= 9) / length(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis)
frac3=sum(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis <= 9) / length(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis)
frac1-frac2
## [1] 0.06649377
frac1-frac3
## [1] 0.06867757
Very similar fraction of peak considerred to have a closer GAT. Again, it alligns with our expectations that differences between indep and GATA3 peaks will be more than full set.
1) quantile0.6 vs. “full.DHS.control”:
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.6'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 10
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
When comparing GATA3 peaks within quantile 0.6 to the full.DHS.control (MCF7 cells), we observe that at a distance of 10 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.
2) quantile0.6 vs. “indep.DHS.control”:
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.6'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 10
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
When comparing GATA3 peaks within quantile 0.6 to the indep.DHS.control (MCF7 cells), we observe that at a distance of 10 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.
library(lattice)
library(latticeExtra)
df1=rbind(df.chip[df.chip$status=="quantile0.6",], df.neg.consensus)
df1$status = factor(df1$status, levels = c("quantile0.6", "full.DHS.control", "indep.DHS.control"))
ecdfplot(~abs(dis), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 10, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 1, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.6 compared with the negative control. At the marked red dotted line positioned at distance 10, we empirically identify the point where the two traces align in parallel.
We can determine the fraction of peaks with closest GAT (4.9%):
frac1=sum(df.chip[df.chip$status=="quantile0.6",]$dis <= 10) / length(df.chip[df.chip$status=="quantile0.6",]$dis)
frac2=sum(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis <= 10) / length(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis)
frac3=sum(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis <= 10) / length(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis)
frac1-frac2
## [1] 0.04716174
frac1-frac3
## [1] 0.04948643
1) quantile0.4 vs. “full.DHS.control”:
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.4'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 5
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
2) quantile0.4 vs. “indep.DHS.control”:
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.4'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 5
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
When comparing GATA3 peaks within quantile 0.4 to the DHS negative controls (MCF7 cells), we observe that at a distance of 5 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.
library(lattice)
library(latticeExtra)
df1=rbind(df.chip[df.chip$status=="quantile0.4",], df.neg.consensus)
df1$status = factor(df1$status, levels = c("quantile0.4", "full.DHS.control", "indep.DHS.control"))
ecdfplot(~abs(dis), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 5, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 0.699, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
In the above CDF plot, the left-shift is not as prominent as previous few cdf plots. At the marked red dotted line positioned at distance 4, we empirically identify the point where the two traces align in parallel.
We can determine the fraction of peaks with closest GAT (2.5%):
frac1=sum(df.chip[df.chip$status=="quantile0.4",]$dis <= 5) / length(df.chip[df.chip$status=="quantile0.4",]$dis)
frac2=sum(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis <= 5) / length(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis)
frac3=sum(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis <= 5) / length(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis)
frac1-frac2
## [1] 0.02287906
frac1-frac3
## [1] 0.02469882
1) quantile0.2 vs. “full.DHS.control”:
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.2'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 8
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(full.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
2) quantile0.2 vs. “indep.DHS.control”:
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.2'])
match.y = seq(0, 1400, by=1) # creating indices
rep.y = seq(0, 1400, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 8
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = 'Distance indices (abs) between closest 3mer-GAT to peak summit (bp)',
ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(indep.DHS.control) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
When comparing GATA3 peaks within quantile 0.2 to the DHS negative controls (MCF7 cells), we observe that at a distance of 8 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.
library(lattice)
library(latticeExtra)
df1=rbind(df.chip[df.chip$status=="quantile0.2",], df.neg.consensus)
df1$status = factor(df1$status, levels = c("quantile0.2", "full.DHS.control", "indep.DHS.control"))
ecdfplot(~abs(dis), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 8, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 0.9, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
panel.ecdfplot(...)
})
In the above CDF plot, left-shift is not prominent. At the marked red dotted line positioned at distance 8, we empirically identify the point where the two traces align in parallel.
We can determine the fraction of peaks with closest GAT (1.7%):
frac1=sum(df.chip[df.chip$status=="quantile0.2",]$dis <= 8) / length(df.chip[df.chip$status=="quantile0.2",]$dis)
frac2=sum(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis <= 8) / length(df.neg.consensus[df.neg.consensus$status=="full.DHS.control",]$dis)
frac3=sum(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis <= 8) / length(df.neg.consensus[df.neg.consensus$status=="indep.DHS.control",]$dis)
frac1-frac2
## [1] 0.01517782
frac1-frac3
## [1] 0.01722397
Summarize the fraction of peaks (in each quantile) that has a closest GAT to its peak summits.
#module load R/4.1.2
#R
library(lattice)
bar=read.csv(file ="barchart_sum.csv", header = T)
str(bar)
bar$lib=factor(bar$lib, levels=c("quantile1","quantile0.8","quantile0.6","quantile0.4","quantile0.2",))
pdf('240119_fraction_of_peaks_have_closer_GAT.pdf', width=10,height=8)
my.settings <- list(
#superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
superpose.polygon=list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(fraction ~ lib | negctrl,
data = bar,
groups = lib,
stack = TRUE,
ylim=c(0,0.25),
auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
ylab = "fraction of peak set with closer GAT to summit",
xlab = "peaks without motifs in quantiles",
par.settings = my.settings)
)
dev.off()
png('240119_fraction_of_peaks_have_closer_GAT.png')
my.settings <- list(
#superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
superpose.polygon=list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(fraction ~ lib | negctrl,
data = bar,
groups = lib,
stack = TRUE,
ylim=c(0,0.25),
auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
ylab = "fraction of peak set with closer GAT to summit",
xlab = "peaks without motifs in quantiles",
par.settings = my.settings)
)
dev.off()
Figure 10: fraction of peaks in each quantile with closer GAT compared to 4 different negative ctrls
library(lattice)
bar=read.csv(file ="barchart_sum.csv", header = T)
str(bar)
bar$lib=factor(bar$lib, levels=c("quantile1","quantile0.8","quantile0.6","quantile0.4","quantile0.2",))
pdf('240119_fraction_of_peaks_have_closer_GAT_2.pdf', width=8,height=6)
my.settings <- list(
#superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
superpose.polygon=list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(fraction ~ negctrl,
data = bar,
groups = lib,
stack = TRUE,
ylim=c(0,0.50),
auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
ylab = "fraction of peak set with closer GAT to summit",
xlab = "peaks without motifs in quantiles",
par.settings = my.settings)
)
dev.off()
png('240119_fraction_of_peaks_have_closer_GAT_2.png')
my.settings <- list(
#superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
superpose.polygon=list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(barchart(fraction ~ negctrl,
data = bar,
groups = lib,
stack = TRUE,
ylim=c(0,0.50),
auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
ylab = "fraction of peak set with closer GAT to summit",
xlab = "peaks without motifs in quantiles",
par.settings = my.settings)
)
dev.off()
Figure 11: fraction of peaks in each quantile with closer GAT compared to 4 different negative ctrls2
In the above analysis, we have used only 3mer GAT to do the analysis.
Automate the process that concatenate the minus and plus 3mer coordinates file together.
3mer coordinates on full hg38 use read size==1000.
# Specify the directory containing the bed files
dir="/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/"
# Loop through each 3mer combination
for triplet in {A,T,C,G}{A,T,C,G}{A,T,C,G}
do
# Define the file names for plus and minus files
plus_file="${dir}hg38.3.3.3plus.*_${triplet}.bed"
minus_file="${dir}hg38.3.3.3minus*_${triplet}.bed"
# Concatenate the contents of plus and minus files to a new file
cat $plus_file $minus_file > hg38.3.3.3.1000_plus_minus_${triplet}.bed
echo "hg38.3.3.3.1000_plus_minus_${triplet}.bed"
done
Automate a parallel running process to find closest distance between different 3mer and our defined peak summit (positive and negative).
First create the R file (closestBed_3mer.R) to run the closestBed job:
#!/usr/bin/env Rscript
Args=commandArgs(TRUE)
library(bigWig)
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not use scientific notation when writing out
#write bed formatted data.frames to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
# the command sort a and b file by coordinates
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
# the command call closestBed on bed1 and bed2
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
# remove intermediate files
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
# load 3mer bed
input_bed_file <- commandArgs(trailingOnly = TRUE)[1]
# Generate unique prefix based on the 3mer
output_prefix <- gsub(".*_(\\w+)\\.bed", "\\1", basename(input_bed_file))
print(output_prefix)
# load the 3mer coordinates file
trip.file=read.table(file = input_bed_file, sep="\t", header=FALSE)
head(trip.file)
tail(trip.file)
nrow(trip.file)
# pos control
pos.control.peak.with.all=data.frame()
for (mast.bed in Sys.glob(file.path("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/", paste("with_motifs_", "*mast.bed", sep ='')))) {
temp=read.table(mast.bed, header=FALSE)[, 1:5] #only take peak regions
uniq.peak=temp[!duplicated(temp$V4), ] # make sure that the peak regions are unique
uniq.peak.center=center.bed(uniq.peak, upstreamWindow = 0, downstreamWindow = 0) #take the summit
pos.control.peak.with.all=rbind(pos.control.peak.with.all, uniq.peak.center)
}
closest.1st3mer.to.pos.control.motif.all=bedTools.closest(bed1 = pos.control.peak.with.all[,1:3], bed2 = trip.file, opt.string = '-d -t first')
write.table(closest.1st3mer.to.pos.control.motif.all,file= paste0('closest.1st',output_prefix,'.to.pos.control.motif.all.bed'), quote=F,sep="\t",col.names=F,row.names=F)
# consensus neg
full.DHS.control.consensus=center.bed(read.table("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/ENCODE_DHS_GSE29692/MCF7DHS_consensus_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(full.DHS.control.consensus)
head(full.DHS.control.consensus)
indep.DHS.control.consensus=center.bed(read.table("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/ENCODE_DHS_GSE29692/MCF7DHS_consensus_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.consensus)
head(indep.DHS.control.consensus)
closest.1st3mer.to.full.DHS.control.consensus=bedTools.closest(bed1 = full.DHS.control.consensus[,1:3], bed2 = trip.file, opt.string = '-d -t first')
closest.1st3mer.to.indep.DHS.control.consensus=bedTools.closest(bed1 = indep.DHS.control.consensus[,1:3], bed2 = trip.file, opt.string = '-d -t first')
write.table(closest.1st3mer.to.full.DHS.control.consensus,file= paste0('closest.1st',output_prefix,'.to.full.DHS.control.consensus.bed'), quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st3mer.to.indep.DHS.control.consensus,file= paste0('closest.1st',output_prefix,'.to.indep.DHS.control.consensus.bed'), quote=F,sep="\t",col.names=F,row.names=F)
generate a “closestBed_3mer.sh”:
#!/bin/bash
#SBATCH --job-name=closestBed_3mer.sh # name for job
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 16
#SBATCH -p general
#SBATCH --qos=general
#SBATCH --mem=200G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o closestBed_3mer.sh_%j.out
#SBATCH -e closestBed_3mer.sh_%j.err
hostname
# placeholder
triplet='XXX'
module load R/4.1.2
Rscript ../closestBed_3mer.R "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/hg38.3.3.3.1000_plus_minus_${triplet}.bed" >> /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/ks_results_output.txt
echo "Job for ${triplet} completed."
Run the below chunk to submit multiple jobs in parallel:
#!/bin/bash
# Specify the directory containing the 3mer bed files
input_directory="/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
# Specify the output directory
output_base_directory="/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
# Loop through each 3mer combination
for triplet in {A,T,C,G}{A,T,C,G}{A,T,C,G}
do
# Create a distinct output directory for each 3mer
output_directory="${output_base_directory}/${triplet}"
mkdir -p "${output_directory}"
# Create a copy of the template script for each 3mer
cp closestBed_3mer.sh "${output_directory}/closestBed_${triplet}.sh"
# Replace the placeholder with the actual 3mer
sed -i "s/XXX/${triplet}/" "${output_directory}/closestBed_${triplet}.sh"
# Submit the job to SLURM
cd "${output_directory}"
sbatch "closestBed_${triplet}.sh"
cd ..
echo "Submitted job for ${triplet}."
done
KS TEST
CDF and PDFs
#module load R/4.1.2
#R
directory_path <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
output_file <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/p_values.txt"
output_connection <- file(output_file, "w") # Open the output file for writing
folder_names <- apply(expand.grid(c("A", "T", "C", "G"), c("A", "T", "C", "G"), c("A", "T", "C", "G")), 1, paste, collapse = "")
#folder_names <- list.dirs(directory_path, full.names = FALSE) # Get a list of folder names in the directory
library(lattice)
library(latticeExtra)
# Iterate through each folder
for (folder_name in folder_names) {
# Set the path to the current folder
current_folder <- file.path(directory_path, folder_name)
# Read the bed files
neg.indep.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.indep.DHS.control.consensus.bed")), header = FALSE)
neg.full.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.full.DHS.control.consensus.bed")), header = FALSE)
pos.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.pos.control.motif.all.bed")), header = FALSE)
# Perform the ks.test for positive control vs independent negative control
ks_test_result_bed1 <- ks.test(pos.dis[, 11], neg.indep.dis[, 11])
# Perform the ks.test for positive control vs full negative control
ks_test_result_bed2 <- ks.test(pos.dis[, 11], neg.full.dis[, 11])
# Write the p-values to the output file
cat(sprintf("%s\t%s\t%s\n", folder_name, ks_test_result_bed1$p.value, ks_test_result_bed2$p.value), file = output_connection)
# Generate PDF and CDF plots for the distance info in the three bed files
df.plot=as.data.frame(rbind(cbind(neg.indep.dis[, 11], "neg.indep"), cbind(neg.full.dis[, 11], "neg.full"), cbind(pos.dis[, 11], "pos")))
colnames(df.plot)=c("dis","status")
df.plot$status = factor(df.plot$status, levels = c("pos", "neg.indep", "neg.full"))
df.plot$dis=as.numeric(df.plot$dis)
# PDF plot
pdf(file.path(current_folder, paste0("PDF_plot_1st_3mer_", folder_name, "_distance_to_summits.pdf")))
print(densityplot( ~ abs(dis),
groups = status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = paste0("distance (bp) from 1st 3mer ", folder_name, " to summit"),
main = "GATA3 peaks (red) vs. DHS regions (blue)",
between=list(y=1.0),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 0, lty =2, col="red")
panel.abline(v= 16, lty =2,col="red")
panel.densityplot(...)
}
))
dev.off()
# CDF plot
pdf(file.path(current_folder, paste0("CDF_plot_1st_3mer_", folder_name, "_distance_to_summits.pdf")))
print(ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red", "blue4","blue"),
aspect = 1,
#xlim = c(0, 50000),
#scales=list(relation="free",alternating=c(1,1,1,1)),
main=paste0("3mer: ", folder_name),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~ '3mer Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("red", "blue4","blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 1.204, lty =2, col="red") #log10(16)=1.204
panel.ecdfplot(...)
}
))
dev.off()
}
# Close the output file
close(output_connection)
The above PDF and CDF plots provide a visual inspection of the closest 3mer’s distribution for each GATA3 peak/DHS regions summit.
The densityplot() uses kernel density probability estimate to calculate the density probability of numeric variables.
To quantitatively compare the differences of 3mer enrichment near summit regions within range (0,16)(bp) between GATA3 peaks and DHS regions, we can calculate the differences of AUC within the range we selected.
#module load R/4.1.2
#R
directory_path <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
output_file <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/AUC_values.txt"
output_connection <- file(output_file, "w") # Open the output file for writing
folder_names <- apply(expand.grid(c("A", "T", "C", "G"), c("A", "T", "C", "G"), c("A", "T", "C", "G")), 1, paste, collapse = "")
#folder_names <- list.dirs(directory_path, full.names = FALSE) # Get a list of folder names in the directory
#define a function to calculate AUC
library(lattice)
calculate_area_under_curve <- function(df, status, range_start, range_end) {
# Calculate density values using the density() function
density_values <- density(abs(df[df$status==status,]$dis), from=0, to=50)
# Find indices corresponding to the specified range
indices <- density_values$x >= range_start & density_values$x <= range_end
# Subset density values within the specified range
density_values_within_range <- density_values$y[indices]
# Calculate the area under the curve
area_under_curve <- sum(density_values_within_range) * diff(density_values$x)[1]
return(area_under_curve)
}
# Iterate through each folder
for (folder_name in folder_names) {
# Set the path to the current folder
current_folder <- file.path(directory_path, folder_name)
# Read the bed files
neg.indep.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.indep.DHS.control.consensus.bed")), header = FALSE)
neg.full.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.full.DHS.control.consensus.bed")), header = FALSE)
pos.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.pos.control.motif.all.bed")), header = FALSE)
# combine data for convenience
df.plot=as.data.frame(rbind(cbind(neg.indep.dis[, 11], "neg.indep"), cbind(neg.full.dis[, 11], "neg.full"), cbind(pos.dis[, 11], "pos")))
colnames(df.plot)=c("dis","status")
df.plot$status = factor(df.plot$status, levels = c("pos", "neg.indep", "neg.full"))
df.plot$dis=as.numeric(df.plot$dis)
#get the AUC for each status
pos.AUC=calculate_area_under_curve(df.plot, "pos", 0, 16)
neg.full.AUC=calculate_area_under_curve(df.plot, "neg.full", 0, 16)
neg.indep.AUC=calculate_area_under_curve(df.plot, "neg.indep", 0, 16)
#get the average difference between pos.AUC vs. neg.AUC
auc_difference <- ((pos.AUC - neg.full.AUC)+(pos.AUC - neg.indep.AUC))/2
# Print the auc_difference value as needed
cat(sprintf("%s\t%s\t%s\t%s\t%s\n", folder_name, pos.AUC, neg.full.AUC, neg.indep.AUC, auc_difference), file = output_connection)
}
# Close the output file
close(output_connection)
Rank the difference of AUC from largest difference to smallest difference.
AUC_df <- read.table("AUC_values.txt", header = FALSE, sep = "\t")
colnames(AUC_df)=c("triplet", "prox_enrich_16bp_pos", "prox_enrich_16bp_neg_full", "prox_enrich_16bp_neg_indep", "prox_enrich_16bp_ave_difference")
# Sort the data frame based on the fifth column in descending order
sorted_AUC_df <- AUC_df[order(-AUC_df$prox_enrich_16bp_ave_difference), ]
sorted_AUC_df
## triplet prox_enrich_16bp_pos prox_enrich_16bp_neg_full
## 5 ATA 0.67575636 0.2611054
## 18 TAT 0.67555116 0.2611610
## 2 TAA 0.71549728 0.3570873
## 6 TTA 0.71526187 0.3570808
## 37 ATC 0.71384355 0.3795742
## 20 GAT 0.71378404 0.3795710
## 17 AAT 0.70971268 0.4378935
## 21 ATT 0.70955517 0.4379173
## 22 TTT 0.72517700 0.5156224
## 1 AAA 0.72519652 0.5157115
## 13 AGA 0.77023909 0.6252357
## 26 TCT 0.77013636 0.6253057
## 7 CTA 0.51598650 0.3817886
## 50 TAG 0.51601000 0.3818405
## 53 ATG 0.62996153 0.4983060
## 19 CAT 0.62999918 0.4984317
## 8 GTA 0.45579672 0.3388961
## 14 TGA 0.74202518 0.6258732
## 10 TCA 0.74202329 0.6259337
## 34 TAC 0.45563805 0.3388954
## 23 CTT 0.71614282 0.6318276
## 49 AAG 0.71614901 0.6318533
## 3 CAA 0.68010829 0.6035399
## 54 TTG 0.68019167 0.6036621
## 33 AAC 0.59547634 0.5291042
## 24 GTT 0.59542905 0.5292581
## 9 ACA 0.66148432 0.6148641
## 30 TGT 0.66145053 0.6149111
## 4 GAA 0.64464136 0.6148243
## 38 TTC 0.64451367 0.6147675
## 25 ACT 0.60963273 0.5961559
## 29 AGT 0.60976449 0.5981685
## 58 TCG 0.06543796 0.1449888
## 15 CGA 0.06545949 0.1450008
## 55 CTG 0.69015252 0.7778653
## 51 CAG 0.69014408 0.7780839
## 31 CGT 0.07588961 0.1818587
## 57 ACG 0.07591769 0.1819334
## 39 CTC 0.54843845 0.6717089
## 52 GAG 0.54859546 0.6718449
## 40 GTC 0.39205548 0.5315965
## 36 GAC 0.39199995 0.5316651
## 47 CGC 0.04676302 0.1895542
## 60 GCG 0.04675334 0.1896063
## 32 GGT 0.42969908 0.5720952
## 41 ACC 0.42961747 0.5721292
## 12 GCA 0.48273206 0.6312974
## 46 TGC 0.48270476 0.6312630
## 59 CCG 0.05668801 0.2121436
## 63 CGG 0.05668766 0.2128744
## 35 CAC 0.49126410 0.6412181
## 56 GTG 0.49134502 0.6413699
## 42 TCC 0.50716206 0.6622496
## 16 GGA 0.50723344 0.6624667
## 11 CCA 0.57162994 0.7252596
## 62 TGG 0.57159081 0.7252513
## 27 CCT 0.55658902 0.7336102
## 61 AGG 0.55659318 0.7338365
## 28 GCT 0.45322427 0.6311430
## 45 AGC 0.45324768 0.6311876
## 43 CCC 0.30513374 0.5576578
## 64 GGG 0.30518660 0.5577910
## 44 GCC 0.34964604 0.6285338
## 48 GGC 0.34964278 0.6286900
## prox_enrich_16bp_neg_indep prox_enrich_16bp_ave_difference
## 5 0.2480388 0.42118427
## 18 0.2481019 0.42091973
## 2 0.3425375 0.36568493
## 6 0.3425197 0.36546159
## 37 0.3747919 0.33666050
## 20 0.3747534 0.33662187
## 17 0.4247085 0.27841170
## 21 0.4247522 0.27822042
## 22 0.4982319 0.21824983
## 1 0.4983291 0.21817623
## 13 0.6203020 0.14747022
## 26 0.6203894 0.14728883
## 7 0.3813805 0.13440198
## 50 0.3814302 0.13437466
## 53 0.4944738 0.13357160
## 19 0.4946709 0.13344787
## 8 0.3337786 0.11945938
## 14 0.6193319 0.11942262
## 10 0.6193673 0.11937276
## 34 0.3337733 0.11930370
## 23 0.6246091 0.08792450
## 49 0.6246415 0.08790162
## 3 0.5963776 0.08014951
## 54 0.5965378 0.08009172
## 33 0.5182054 0.07182152
## 24 0.5198085 0.07089579
## 9 0.6138030 0.04715076
## 30 0.6138467 0.04707159
## 4 0.6035559 0.03545126
## 38 0.6034564 0.03540172
## 25 0.5915971 0.01575625
## 29 0.5920780 0.01464126
## 58 0.1364455 -0.07527919
## 15 0.1364792 -0.07528055
## 55 0.7777854 -0.08767287
## 51 0.7780507 -0.08792322
## 31 0.1748743 -0.10247691
## 57 0.1754903 -0.10279414
## 39 0.6707086 -0.12277030
## 52 0.6709800 -0.12281699
## 40 0.5274546 -0.13747004
## 36 0.5274876 -0.13757641
## 47 0.1818172 -0.13892264
## 60 0.1819010 -0.13900034
## 32 0.5744897 -0.14359336
## 41 0.5745524 -0.14372335
## 12 0.6316599 -0.14874660
## 46 0.6316565 -0.14875501
## 59 0.2034568 -0.15111220
## 63 0.2035609 -0.15152999
## 35 0.6461654 -0.15242765
## 56 0.6463493 -0.15251454
## 42 0.6609630 -0.15444422
## 16 0.6612527 -0.15462623
## 11 0.7304133 -0.15620648
## 62 0.7304352 -0.15625242
## 27 0.7354424 -0.17793728
## 61 0.7357602 -0.17820517
## 28 0.6340127 -0.17935362
## 45 0.6340463 -0.17936927
## 43 0.5624881 -0.25493921
## 64 0.5627153 -0.25506652
## 44 0.6305226 -0.27988216
## 48 0.6307873 -0.28009586
calculate the fraction of peaks has closer 3mer by CDF.
#module load R/4.1.2
#R
directory_path <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
output_file <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/CDF_fractions.txt"
output_connection <- file(output_file, "w") # Open the output file for writing
folder_names <- apply(expand.grid(c("A", "T", "C", "G"), c("A", "T", "C", "G"), c("A", "T", "C", "G")), 1, paste, collapse = "")
#folder_names <- list.dirs(directory_path, full.names = FALSE) # Get a list of folder names in the directory
#define a function to calculate AUC
library(lattice)
# Iterate through each folder
for (folder_name in folder_names) {
# Set the path to the current folder
current_folder <- file.path(directory_path, folder_name)
# Read the bed files
neg.indep.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.indep.DHS.control.consensus.bed")), header = FALSE)
neg.full.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.full.DHS.control.consensus.bed")), header = FALSE)
pos.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.pos.control.motif.all.bed")), header = FALSE)
# combine data for convenience
df.plot=as.data.frame(rbind(cbind(neg.indep.dis[, 11], "neg.indep"), cbind(neg.full.dis[, 11], "neg.full"), cbind(pos.dis[, 11], "pos")))
colnames(df.plot)=c("dis","status")
df.plot$status = factor(df.plot$status, levels = c("pos", "neg.indep", "neg.full"))
df.plot$dis=as.numeric(df.plot$dis)
#get the fraction of regions with closer 3mer for each status
pos.peak=ecdf(df.plot[df.plot$status=="pos",]$dis)(16)
neg.full.peak=ecdf(df.plot[df.plot$status=="neg.full",]$dis)(16)
neg.indep.peak=ecdf(df.plot[df.plot$status=="neg.indep",]$dis)(16)
#get the average difference between fraction of pos.peak vs. neg.peak
fraction_difference <- ((pos.peak - neg.full.peak)+(pos.peak - neg.indep.peak))/2
# Print the auc_difference value as needed
cat(sprintf("%s\t%s\t%s\t%s\t%s\n", folder_name, pos.peak, neg.full.peak, neg.indep.peak, fraction_difference), file = output_connection)
}
# Close the output file
close(output_connection)
Rank the fraction of peaks with closer 3mer from largest difference to smallest difference.
CDF_df <- read.table("CDF_fractions.txt", header = FALSE, sep = "\t")
colnames(CDF_df)=c("triplet", "prox_enrich_16bp_pos", "prox_enrich_16bp_neg_full", "prox_enrich_16bp_neg_indep", "prox_enrich_16bp_ave_difference")
# Sort the data frame based on the fifth column in descending order
sorted_CDF_df <- CDF_df[order(-CDF_df$prox_enrich_16bp_ave_difference), ]
sorted_CDF_df
## triplet prox_enrich_16bp_pos prox_enrich_16bp_neg_full
## 5 ATA 0.76042646 0.3272464
## 18 TAT 0.76019140 0.3273733
## 2 TAA 0.80060443 0.4243450
## 6 TTA 0.80033580 0.4243704
## 37 ATC 0.80638012 0.4366963
## 20 GAT 0.80636333 0.4367217
## 17 AAT 0.79555071 0.5094347
## 21 ATT 0.79536602 0.5094474
## 22 TTT 0.81966085 0.5949555
## 1 AAA 0.81971122 0.5950950
## 13 AGA 0.86746138 0.6993076
## 26 TCT 0.86731028 0.6993964
## 50 TAG 0.58890195 0.4434934
## 7 CTA 0.58886837 0.4434680
## 53 ATG 0.70525520 0.5679576
## 19 CAT 0.70532236 0.5680844
## 14 TGA 0.82622565 0.7005123
## 10 TCA 0.82620887 0.7005884
## 8 GTA 0.51878778 0.3989957
## 34 TAC 0.51861988 0.3989703
## 23 CTT 0.79578576 0.7038347
## 49 AAG 0.79580255 0.7039108
## 3 CAA 0.75475151 0.6764057
## 54 TTG 0.75485225 0.6765705
## 33 AAC 0.66371726 0.5979609
## 24 GTT 0.66363331 0.5981131
## 9 ACA 0.73999328 0.6925994
## 30 TGT 0.73994291 0.6926754
## 4 GAA 0.71727670 0.6874382
## 38 TTC 0.71714238 0.6874255
## 25 ACT 0.67978509 0.6673007
## 29 AGT 0.67990262 0.6676558
## 55 CTG 0.77159167 0.8694616
## 51 CAG 0.77162525 0.8696264
## 15 CGA 0.10862995 0.2168074
## 58 TCG 0.10857958 0.2167821
## 31 CGT 0.11626931 0.2479647
## 57 ACG 0.11628610 0.2480408
## 39 CTC 0.61878778 0.7521368
## 52 GAG 0.61893889 0.7522509
## 40 GTC 0.44974815 0.5965660
## 36 GAC 0.44971457 0.5966040
## 32 GGT 0.49069846 0.6398590
## 41 ACC 0.49064809 0.6398970
## 12 GCA 0.54852250 0.7066880
## 46 TGC 0.54845534 0.7067260
## 35 CAC 0.55775688 0.7167947
## 56 GTG 0.55784083 0.7169342
## 42 TCC 0.57372398 0.7386188
## 16 GGA 0.57385829 0.7389104
## 62 TGG 0.64152116 0.8078952
## 11 CCA 0.64153794 0.8079206
## 45 AGC 0.51714238 0.7057369
## 28 GCT 0.51712559 0.7057496
## 27 CCT 0.62694762 0.8179386
## 61 AGG 0.62696441 0.8181668
## 59 CCG 0.09969778 0.3184078
## 63 CGG 0.09969778 0.3184585
## 47 CGC 0.08720618 0.3068046
## 60 GCG 0.08722297 0.3068934
## 43 CCC 0.36433848 0.6370692
## 64 GGG 0.36442243 0.6372087
## 44 GCC 0.41022498 0.7084126
## 48 GGC 0.41019140 0.7085648
## prox_enrich_16bp_neg_indep prox_enrich_16bp_ave_difference
## 5 0.3148033 0.43940159
## 18 0.3149587 0.43902541
## 2 0.4113391 0.38276239
## 6 0.4113563 0.38247243
## 37 0.4341001 0.37098190
## 20 0.4341001 0.37095243
## 17 0.4958726 0.29289705
## 21 0.4959244 0.29268012
## 22 0.5783166 0.23302481
## 1 0.5784720 0.23292772
## 13 0.6979070 0.16885410
## 26 0.6980106 0.16860680
## 50 0.4453597 0.14447540
## 7 0.4453425 0.14446314
## 53 0.5658826 0.13833509
## 19 0.5660899 0.13823523
## 14 0.6968363 0.12755137
## 10 0.6968708 0.12747927
## 8 0.3956930 0.12144344
## 34 0.3956412 0.12131412
## 23 0.6981487 0.09479403
## 49 0.6982523 0.09472097
## 3 0.6703450 0.08137614
## 54 0.6705523 0.08129084
## 33 0.5894035 0.07003506
## 24 0.5895071 0.06982321
## 9 0.6941077 0.04663976
## 30 0.6941940 0.04650817
## 4 0.6788588 0.03412819
## 38 0.6788070 0.03402611
## 25 0.6640590 0.01410524
## 29 0.6645598 0.01379482
## 55 0.8717922 -0.09903522
## 51 0.8719822 -0.09917905
## 15 0.2091838 -0.10436568
## 58 0.2091148 -0.10436883
## 31 0.2435154 -0.12947072
## 57 0.2436190 -0.12954378
## 39 0.7536179 -0.13408957
## 52 0.7538597 -0.13411641
## 40 0.5942217 -0.14564567
## 36 0.5942217 -0.14569827
## 32 0.6448382 -0.15165013
## 41 0.6448900 -0.15174543
## 12 0.7096329 -0.15963791
## 46 0.7097537 -0.15978454
## 35 0.7245881 -0.16293455
## 56 0.7247781 -0.16301533
## 42 0.7416330 -0.16640191
## 16 0.7420302 -0.16661202
## 62 0.8135599 -0.16920640
## 11 0.8135772 -0.16921093
## 45 0.7107381 -0.19109512
## 28 0.7107726 -0.19113552
## 27 0.8229026 -0.19347299
## 61 0.8232308 -0.19373439
## 59 0.3089317 -0.21397196
## 63 0.3090526 -0.21405777
## 47 0.2989500 -0.21567115
## 60 0.2991054 -0.21577646
## 43 0.6451663 -0.27677925
## 64 0.6454081 -0.27688593
## 44 0.7148482 -0.30140541
## 48 0.7150900 -0.30163596
scatter plot
The intersection point of CDF is essentially same with the AUC of PDF at distance(x)==16bp.
The scatter plot of the y intersection of CDF at x==16bp and the AUC of PDF at x==16bp is on the diagonal.
library(lattice)
xyplot(AUC_df$prox_enrich_16bp_ave_difference ~CDF_df$prox_enrich_16bp_ave_difference,
asp=1,
xlim=c(-0.5,0.5),
ylim=c(-0.5,0.5))
determine categories for 3mer enrichment in proximity to GATA3 peaks
We’ve compiled a list of 3-mers ranked by their proximity to GATA3 peaks. Our next step involves clustering these 3-mers based on their enrichment levels. To gain insights, we’ll use visual plots depicting the distribution of 3-mer enrichment.
Since in the scatter plot above, it is very similar for using AUC from PDF plots or y intersection from CDF plots at the defined distance. From now on, I will only use the y values from CDF plots at defined distance to determine “proximal enrichment”.
CDF_df <- read.table("CDF_fractions.txt", header = FALSE, sep = "\t")
colnames(CDF_df)=c("triplet", "prox_enrich_16bp_pos", "prox_enrich_16bp_neg_full", "prox_enrich_16bp_neg_indep", "prox_enrich_16bp_ave_difference")
head(CDF_df)
## triplet prox_enrich_16bp_pos prox_enrich_16bp_neg_full
## 1 AAA 0.8197112 0.5950950
## 2 TAA 0.8006044 0.4243450
## 3 CAA 0.7547515 0.6764057
## 4 GAA 0.7172767 0.6874382
## 5 ATA 0.7604265 0.3272464
## 6 TTA 0.8003358 0.4243704
## prox_enrich_16bp_neg_indep prox_enrich_16bp_ave_difference
## 1 0.5784720 0.23292772
## 2 0.4113391 0.38276239
## 3 0.6703450 0.08137614
## 4 0.6788588 0.03412819
## 5 0.3148033 0.43940159
## 6 0.4113563 0.38247243
Barchart of GATA3 peak
CDF_df_rank_pos=CDF_df[order(-CDF_df$prox_enrich_16bp_ave_difference), ][,c(1,5)]
# Reorder levels of the factor variable based on fraction
CDF_df_rank_pos$triplet<- factor(CDF_df_rank_pos$triplet,
levels = CDF_df_rank_pos$triplet[order(CDF_df_rank_pos$prox_enrich_16bp_ave_difference, decreasing = TRUE)])
head(CDF_df_rank_pos)
## triplet prox_enrich_16bp_ave_difference
## 5 ATA 0.4394016
## 18 TAT 0.4390254
## 2 TAA 0.3827624
## 6 TTA 0.3824724
## 37 ATC 0.3709819
## 20 GAT 0.3709524
my.settings <- list(
superpose.polygon=list(col=c("black", "grey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
barchart(prox_enrich_16bp_ave_difference ~ triplet,
#groups=triplet,
horizontal = FALSE,
data=CDF_df_rank_pos,
#type="count",
#layout = c(1, length(unique(CDF_df_rank_pos$triplet))),
#ylim=c(-0,0.5),
col=ifelse(CDF_df_rank_pos$prox_enrich_16bp_ave_difference >= 0, "skyblue", "grey"),
origin = 0,
scales = list(x = list(cex = 0.5, rot = 45)),
xlab="3mer",
ylab="fraction of proximal emrichment",
main="3mer Proximal Emrichment to GATA3 peaks at 16bp",
cex.axis=0.5,
between=list(y=0.5, x=0.5),
font.axis=1,
par.settings=my.settings,
panel=function(...) {
panel.barchart(...)
panel.abline(h=0.20, lty = "dotted", col = "red", identifier = "abline")
panel.abline(h=0, lty = "dotted", col = "red", identifier = "abline")
}
)
I want to put cut-offs at fration==0.20, and fraction==0; The 3mers that has a proximal enrichment larger than 0.15 will be defined as “high enrichment”, between 0 to 0.15 is “low enrichment”, fraction below 0 is “no enrichment”.
test robustness of our categorization
The fraction we plot in the above plots are using distance at 16bp to define 3mer enrichment “near” GATA3 peaks. We will now plotting using distance at 20bp, 30bp, and 50bp to see if we can get robust results for clustering.
Calculate the fraction of peaks has closer 3mer by CDF.
At distance==20bp:
#module load R/4.1.2
#R
directory_path <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
output_file <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/CDF_fractions_20bp.txt"
output_connection <- file(output_file, "w") # Open the output file for writing
folder_names <- apply(expand.grid(c("A", "T", "C", "G"), c("A", "T", "C", "G"), c("A", "T", "C", "G")), 1, paste, collapse = "")
#folder_names <- list.dirs(directory_path, full.names = FALSE) # Get a list of folder names in the directory
#define a function to calculate AUC
library(lattice)
# Iterate through each folder
for (folder_name in folder_names) {
# Set the path to the current folder
current_folder <- file.path(directory_path, folder_name)
# Read the bed files
neg.indep.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.indep.DHS.control.consensus.bed")), header = FALSE)
neg.full.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.full.DHS.control.consensus.bed")), header = FALSE)
pos.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.pos.control.motif.all.bed")), header = FALSE)
# combine data for convenience
df.plot=as.data.frame(rbind(cbind(neg.indep.dis[, 11], "neg.indep"), cbind(neg.full.dis[, 11], "neg.full"), cbind(pos.dis[, 11], "pos")))
colnames(df.plot)=c("dis","status")
df.plot$status = factor(df.plot$status, levels = c("pos", "neg.indep", "neg.full"))
df.plot$dis=as.numeric(df.plot$dis)
#get the fraction of regions with closer 3mer for each status
pos.peak=ecdf(df.plot[df.plot$status=="pos",]$dis)(20)
neg.full.peak=ecdf(df.plot[df.plot$status=="neg.full",]$dis)(20)
neg.indep.peak=ecdf(df.plot[df.plot$status=="neg.indep",]$dis)(20)
#get the average difference between fraction of pos.peak vs. neg.peak
fraction_difference <- ((pos.peak - neg.full.peak)+(pos.peak - neg.indep.peak))/2
# Print the CDF_fraction_difference value as needed
cat(sprintf("%s\t%s\t%s\t%s\t%s\n", folder_name, pos.peak, neg.full.peak, neg.indep.peak, fraction_difference), file = output_connection)
}
# Close the output file
close(output_connection)
At distance==30bp:
#module load R/4.1.2
#R
directory_path <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
output_file <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/CDF_fractions_30bp.txt"
output_connection <- file(output_file, "w") # Open the output file for writing
folder_names <- apply(expand.grid(c("A", "T", "C", "G"), c("A", "T", "C", "G"), c("A", "T", "C", "G")), 1, paste, collapse = "")
#folder_names <- list.dirs(directory_path, full.names = FALSE) # Get a list of folder names in the directory
#define a function to calculate AUC
library(lattice)
# Iterate through each folder
for (folder_name in folder_names) {
# Set the path to the current folder
current_folder <- file.path(directory_path, folder_name)
# Read the bed files
neg.indep.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.indep.DHS.control.consensus.bed")), header = FALSE)
neg.full.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.full.DHS.control.consensus.bed")), header = FALSE)
pos.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.pos.control.motif.all.bed")), header = FALSE)
# combine data for convenience
df.plot=as.data.frame(rbind(cbind(neg.indep.dis[, 11], "neg.indep"), cbind(neg.full.dis[, 11], "neg.full"), cbind(pos.dis[, 11], "pos")))
colnames(df.plot)=c("dis","status")
df.plot$status = factor(df.plot$status, levels = c("pos", "neg.indep", "neg.full"))
df.plot$dis=as.numeric(df.plot$dis)
#get the fraction of regions with closer 3mer for each status
pos.peak=ecdf(df.plot[df.plot$status=="pos",]$dis)(30)
neg.full.peak=ecdf(df.plot[df.plot$status=="neg.full",]$dis)(30)
neg.indep.peak=ecdf(df.plot[df.plot$status=="neg.indep",]$dis)(30)
#get the average difference between fraction of pos.peak vs. neg.peak
fraction_difference <- ((pos.peak - neg.full.peak)+(pos.peak - neg.indep.peak))/2
# Print the CDF_fraction_difference value as needed
cat(sprintf("%s\t%s\t%s\t%s\t%s\n", folder_name, pos.peak, neg.full.peak, neg.indep.peak, fraction_difference), file = output_connection)
}
# Close the output file
close(output_connection)
At distance==40bp:
#module load R/4.1.2
#R
directory_path <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
output_file <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/CDF_fractions_40bp.txt"
output_connection <- file(output_file, "w") # Open the output file for writing
folder_names <- apply(expand.grid(c("A", "T", "C", "G"), c("A", "T", "C", "G"), c("A", "T", "C", "G")), 1, paste, collapse = "")
#folder_names <- list.dirs(directory_path, full.names = FALSE) # Get a list of folder names in the directory
#define a function to calculate AUC
library(lattice)
# Iterate through each folder
for (folder_name in folder_names) {
# Set the path to the current folder
current_folder <- file.path(directory_path, folder_name)
# Read the bed files
neg.indep.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.indep.DHS.control.consensus.bed")), header = FALSE)
neg.full.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.full.DHS.control.consensus.bed")), header = FALSE)
pos.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.pos.control.motif.all.bed")), header = FALSE)
# combine data for convenience
df.plot=as.data.frame(rbind(cbind(neg.indep.dis[, 11], "neg.indep"), cbind(neg.full.dis[, 11], "neg.full"), cbind(pos.dis[, 11], "pos")))
colnames(df.plot)=c("dis","status")
df.plot$status = factor(df.plot$status, levels = c("pos", "neg.indep", "neg.full"))
df.plot$dis=as.numeric(df.plot$dis)
#get the fraction of regions with closer 3mer for each status
pos.peak=ecdf(df.plot[df.plot$status=="pos",]$dis)(40)
neg.full.peak=ecdf(df.plot[df.plot$status=="neg.full",]$dis)(40)
neg.indep.peak=ecdf(df.plot[df.plot$status=="neg.indep",]$dis)(40)
#get the average difference between fraction of pos.peak vs. neg.peak
fraction_difference <- ((pos.peak - neg.full.peak)+(pos.peak - neg.indep.peak))/2
# Print the CDF_fraction_difference value as needed
cat(sprintf("%s\t%s\t%s\t%s\t%s\n", folder_name, pos.peak, neg.full.peak, neg.indep.peak, fraction_difference), file = output_connection)
}
# Close the output file
close(output_connection)
At distance==50bp:
#module load R/4.1.2
#R
directory_path <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
output_file <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/CDF_fractions_50bp.txt"
output_connection <- file(output_file, "w") # Open the output file for writing
folder_names <- apply(expand.grid(c("A", "T", "C", "G"), c("A", "T", "C", "G"), c("A", "T", "C", "G")), 1, paste, collapse = "")
#folder_names <- list.dirs(directory_path, full.names = FALSE) # Get a list of folder names in the directory
#define a function to calculate AUC
library(lattice)
# Iterate through each folder
for (folder_name in folder_names) {
# Set the path to the current folder
current_folder <- file.path(directory_path, folder_name)
# Read the bed files
neg.indep.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.indep.DHS.control.consensus.bed")), header = FALSE)
neg.full.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.full.DHS.control.consensus.bed")), header = FALSE)
pos.dis <- read.table(file.path(current_folder, paste0("closest.1st",folder_name,".to.pos.control.motif.all.bed")), header = FALSE)
# combine data for convenience
df.plot=as.data.frame(rbind(cbind(neg.indep.dis[, 11], "neg.indep"), cbind(neg.full.dis[, 11], "neg.full"), cbind(pos.dis[, 11], "pos")))
colnames(df.plot)=c("dis","status")
df.plot$status = factor(df.plot$status, levels = c("pos", "neg.indep", "neg.full"))
df.plot$dis=as.numeric(df.plot$dis)
#get the fraction of regions with closer 3mer for each status
pos.peak=ecdf(df.plot[df.plot$status=="pos",]$dis)(50)
neg.full.peak=ecdf(df.plot[df.plot$status=="neg.full",]$dis)(50)
neg.indep.peak=ecdf(df.plot[df.plot$status=="neg.indep",]$dis)(50)
#get the average difference between fraction of pos.peak vs. neg.peak
fraction_difference <- ((pos.peak - neg.full.peak)+(pos.peak - neg.indep.peak))/2
# Print the CDF_fraction_difference value as needed
cat(sprintf("%s\t%s\t%s\t%s\t%s\n", folder_name, pos.peak, neg.full.peak, neg.indep.peak, fraction_difference), file = output_connection)
}
# Close the output file
close(output_connection)
load data
CDF_df_16bp <- cbind(read.table("CDF_fractions.txt", header = FALSE, sep = "\t"), "16bp")
CDF_df_20bp <- cbind(read.table("CDF_fractions_20bp.txt", header = FALSE, sep = "\t"),"20bp")
CDF_df_30bp <- cbind(read.table("CDF_fractions_30bp.txt", header = FALSE, sep = "\t"),"30bp")
CDF_df_40bp <- cbind(read.table("CDF_fractions_40bp.txt", header = FALSE, sep = "\t"),"40bp")
CDF_df_50bp <- cbind(read.table("CDF_fractions_50bp.txt", header = FALSE, sep = "\t"),"50bp")
colnames(CDF_df_16bp)=c("triplet", "prox_enrich_pos", "prox_enrich_neg_full", "prox_enrich_neg_indep", "prox_enrich_ave_difference", "dis")
colnames(CDF_df_20bp)=c("triplet", "prox_enrich_pos", "prox_enrich_neg_full", "prox_enrich_neg_indep", "prox_enrich_ave_difference", "dis")
colnames(CDF_df_30bp)=c("triplet", "prox_enrich_pos", "prox_enrich_neg_full", "prox_enrich_neg_indep", "prox_enrich_ave_difference", "dis")
colnames(CDF_df_40bp)=c("triplet", "prox_enrich_pos", "prox_enrich_neg_full", "prox_enrich_neg_indep", "prox_enrich_ave_difference", "dis")
colnames(CDF_df_50bp)=c("triplet", "prox_enrich_pos", "prox_enrich_neg_full", "prox_enrich_neg_indep", "prox_enrich_ave_difference", "dis")
head(CDF_df_20bp)
## triplet prox_enrich_pos prox_enrich_neg_full prox_enrich_neg_indep
## 1 AAA 0.8716420 0.6567374 0.6416952
## 2 TAA 0.8549362 0.4848715 0.4716782
## 3 CAA 0.8192747 0.7440336 0.7398370
## 4 GAA 0.7862995 0.7542925 0.7453459
## 5 ATA 0.8224312 0.3798473 0.3664042
## 6 TTA 0.8546340 0.4848842 0.4716955
## prox_enrich_ave_difference dis
## 1 0.22242575 20bp
## 2 0.37666131 20bp
## 3 0.07733940 20bp
## 4 0.03648031 20bp
## 5 0.44930542 20bp
## 6 0.37634412 20bp
Barchart of GATA3 peak
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
CDF_df_pos<-rbind(CDF_df_16bp, CDF_df_20bp, CDF_df_30bp, CDF_df_40bp, CDF_df_50bp)
# Reorder levels of the factor variable based on descending fraction
CDF_df_pos <- CDF_df_pos %>%
arrange(dis, desc(prox_enrich_ave_difference)) %>%
group_by(dis) %>%
mutate(triplet = factor(triplet, levels = triplet))
head(CDF_df_pos)
## # A tibble: 6 × 6
## # Groups: dis [1]
## triplet prox_enrich_pos prox_enrich_neg_full prox_enrich_neg_indep
## <fct> <dbl> <dbl> <dbl>
## 1 ATA 0.760 0.327 0.315
## 2 TAT 0.760 0.327 0.315
## 3 TAA 0.801 0.424 0.411
## 4 TTA 0.800 0.424 0.411
## 5 ATC 0.806 0.437 0.434
## 6 GAT 0.806 0.437 0.434
## # ℹ 2 more variables: prox_enrich_ave_difference <dbl>, dis <chr>
CDF_df_pos <- CDF_df_pos %>%
mutate(my_color = ifelse(prox_enrich_ave_difference < 0, "grey",
ifelse(prox_enrich_ave_difference > 0.2, "lightblue", "lightgreen")))
head(CDF_df_pos)
## # A tibble: 6 × 7
## # Groups: dis [1]
## triplet prox_enrich_pos prox_enrich_neg_full prox_enrich_neg_indep
## <fct> <dbl> <dbl> <dbl>
## 1 ATA 0.760 0.327 0.315
## 2 TAT 0.760 0.327 0.315
## 3 TAA 0.801 0.424 0.411
## 4 TTA 0.800 0.424 0.411
## 5 ATC 0.806 0.437 0.434
## 6 GAT 0.806 0.437 0.434
## # ℹ 3 more variables: prox_enrich_ave_difference <dbl>, dis <chr>,
## # my_color <chr>
pdf('3mer_fraction_of_Proximal_Emrichment_to_GATA3_peaks.pdf',width=8,height=8)
my.settings <- list(
superpose.polygon=list(col=c("black", "grey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
print(
barchart(prox_enrich_ave_difference ~ triplet |dis,
horizontal = FALSE,
data=CDF_df_pos,
#type="count",
layout = c(1, 5),
#ylim=c(-0,0.5),
#col=ifelse(CDF_df_pos$prox_enrich_ave_difference >= 0, "skyblue", "grey"),
#col = ifelse(CDF_df_pos$prox_enrich_ave_difference >= 0.20, "skyblue",
# ifelse(CDF_df_pos$prox_enrich_ave_difference >= 0 & CDF_df_pos$prox_enrich_ave_difference < 0.20, "lightgreen", "grey")),
col=CDF_df_pos$my_color,
origin = 0,
scales = list(x = list(cex = 0.5, rot = 45)),
xlab="3mer",
ylab="fraction of proximal emrichment",
main="3mer Proximal Emrichment to GATA3 peaks",
cex.axis=0.5,
between=list(y=0.5, x=0.5),
font.axis=1,
par.settings=my.settings,
panel=function(...) {
panel.barchart(...)
panel.abline(h=0.20, lty = "dotted", col = "red", identifier = "abline")
panel.abline(h=0, lty = "dotted", col = "red", identifier = "abline")
}
)
)
dev.off()
## quartz_off_screen
## 2
It seems that the most robust categorization should be made at fraction differences==0. This will give us two category, “no enrichment” and “with enrichment”. There are 32 3mers enriched in close proximity to GATA3 peaks (remains true when defining “close proximity” with distance ==16bp, 20bp, 30bp, 50bp).
violin plot
my.settings <- list(
superpose.polygon=list(col=c("black", "grey"), border="transparent"),
strip.background=list(col="grey80", cex = 0.6),
strip.border=list(col="black")
)
bwplot(prox_enrich_ave_difference ~ dis,
data=CDF_df_pos,
#do.out = F,
xlab='distance', ylab='fraction of proximal emrichment',
ylim=c(-0.5, 1),
par.settings = my.settings,
scales=list(x=list(rot=45, cex=1)),
panel = function(..., box.ratio) {
panel.violin(...,col="white", varwidth = FALSE, box.ratio = box.ratio, do.out = F)
panel.abline(h= 0, lty =2, col="red")
panel.abline(h= 0.2, lty =2, col="red")
#panel.bwplot(...)
panel.stripplot(...,jitter.data = T, pch=4, cex=0.4, col="grey")
}
)
It seems reasonable to put a cut-off at 0.2 and a cut-off at 0 to divide 3mers into three categories: enriched, neutral and depleted.
3mer in depleted category remains same while using different distance from summits
CDF_df_16bp[CDF_df_16bp$prox_enrich_ave_difference < 0, ]$triplet
## [1] "CCA" "GCA" "CGA" "GGA" "CCT" "GCT" "CGT" "GGT" "CAC" "GAC" "CTC" "GTC"
## [13] "ACC" "TCC" "CCC" "GCC" "AGC" "TGC" "CGC" "GGC" "CAG" "GAG" "CTG" "GTG"
## [25] "ACG" "TCG" "CCG" "GCG" "AGG" "TGG" "CGG" "GGG"
CDF_df_20bp[CDF_df_20bp$prox_enrich_ave_difference < 0, ]$triplet
## [1] "CCA" "GCA" "CGA" "GGA" "CCT" "GCT" "CGT" "GGT" "CAC" "GAC" "CTC" "GTC"
## [13] "ACC" "TCC" "CCC" "GCC" "AGC" "TGC" "CGC" "GGC" "CAG" "GAG" "CTG" "GTG"
## [25] "ACG" "TCG" "CCG" "GCG" "AGG" "TGG" "CGG" "GGG"
CDF_df_30bp[CDF_df_30bp$prox_enrich_ave_difference < 0, ]$triplet
## [1] "CCA" "GCA" "CGA" "GGA" "CCT" "GCT" "CGT" "GGT" "CAC" "GAC" "CTC" "GTC"
## [13] "ACC" "TCC" "CCC" "GCC" "AGC" "TGC" "CGC" "GGC" "CAG" "GAG" "CTG" "GTG"
## [25] "ACG" "TCG" "CCG" "GCG" "AGG" "TGG" "CGG" "GGG"
CDF_df_40bp[CDF_df_40bp$prox_enrich_ave_difference < 0, ]$triplet
## [1] "CCA" "GCA" "CGA" "GGA" "CCT" "GCT" "CGT" "GGT" "CAC" "GAC" "CTC" "GTC"
## [13] "ACC" "TCC" "CCC" "GCC" "AGC" "TGC" "CGC" "GGC" "CAG" "GAG" "CTG" "GTG"
## [25] "ACG" "TCG" "CCG" "GCG" "AGG" "TGG" "CGG" "GGG"
CDF_df_50bp[CDF_df_50bp$prox_enrich_ave_difference < 0, ]$triplet
## [1] "CCA" "GCA" "CGA" "GGA" "CCT" "GCT" "CGT" "GGT" "CAC" "GAC" "CTC" "GTC"
## [13] "ACC" "TCC" "CCC" "GCC" "AGC" "TGC" "CGC" "GGC" "CAG" "GAG" "CTG" "GTG"
## [25] "ACG" "TCG" "CCG" "GCG" "AGG" "TGG" "CGG" "GGG"
3mer in enriched category changed while using different distance from summits
CDF_df_16bp[CDF_df_16bp$prox_enrich_ave_difference > 0.2, ]$triplet # --10 3mers
## [1] "AAA" "TAA" "ATA" "TTA" "AAT" "TAT" "GAT" "ATT" "TTT" "ATC"
CDF_df_20bp[CDF_df_20bp$prox_enrich_ave_difference > 0.2, ]$triplet # --10 3mers
## [1] "AAA" "TAA" "ATA" "TTA" "AAT" "TAT" "GAT" "ATT" "TTT" "ATC"
CDF_df_30bp[CDF_df_30bp$prox_enrich_ave_difference > 0.2, ]$triplet # --8 3mers (TTT/AAA now in the neutral group)
## [1] "TAA" "ATA" "TTA" "AAT" "TAT" "GAT" "ATT" "ATC"
CDF_df_40bp[CDF_df_40bp$prox_enrich_ave_difference > 0.2, ]$triplet # --8 3mers (TTT/AAA now in the neutral group)
## [1] "TAA" "ATA" "TTA" "AAT" "TAT" "GAT" "ATT" "ATC"
CDF_df_50bp[CDF_df_50bp$prox_enrich_ave_difference > 0.2, ]$triplet # --4 3mers (TTT/AAA, AAT/ATT, GAT/ATC now in the neutral group)
## [1] "TAA" "ATA" "TTA" "TAT"
It is unsurprising that, with an increase in the distance to 50bp, the enriched group exhibits fewer 3mers. As we expand the search radius for the closest 3mer (any 3mer), the likelihood of encountering random 3mers rises. This increase in random occurrences diminishes the observed ‘enrichment’ fraction.
3mer in neutral category changed while using different distance from summits
CDF_df_16bp[CDF_df_16bp$prox_enrich_ave_difference < 0.2 & CDF_df_50bp$prox_enrich_ave_difference > 0, ]$triplet # 22 3mers
## [1] "CAA" "GAA" "CTA" "GTA" "ACA" "TCA" "AGA" "TGA" "CAT" "CTT" "GTT" "ACT"
## [13] "TCT" "AGT" "TGT" "AAC" "TAC" "TTC" "AAG" "TAG" "ATG" "TTG"
CDF_df_20bp[CDF_df_20bp$prox_enrich_ave_difference < 0.2 & CDF_df_50bp$prox_enrich_ave_difference > 0, ]$triplet
## [1] "CAA" "GAA" "CTA" "GTA" "ACA" "TCA" "AGA" "TGA" "CAT" "CTT" "GTT" "ACT"
## [13] "TCT" "AGT" "TGT" "AAC" "TAC" "TTC" "AAG" "TAG" "ATG" "TTG"
CDF_df_30bp[CDF_df_30bp$prox_enrich_ave_difference < 0.2 & CDF_df_50bp$prox_enrich_ave_difference > 0, ]$triplet
## [1] "AAA" "CAA" "GAA" "CTA" "GTA" "ACA" "TCA" "AGA" "TGA" "CAT" "TTT" "CTT"
## [13] "GTT" "ACT" "TCT" "AGT" "TGT" "AAC" "TAC" "TTC" "AAG" "TAG" "ATG" "TTG"
CDF_df_40bp[CDF_df_40bp$prox_enrich_ave_difference < 0.2 & CDF_df_50bp$prox_enrich_ave_difference > 0, ]$triplet
## [1] "AAA" "CAA" "GAA" "CTA" "GTA" "ACA" "TCA" "AGA" "TGA" "CAT" "TTT" "CTT"
## [13] "GTT" "ACT" "TCT" "AGT" "TGT" "AAC" "TAC" "TTC" "AAG" "TAG" "ATG" "TTG"
CDF_df_50bp[CDF_df_50bp$prox_enrich_ave_difference < 0.2 & CDF_df_50bp$prox_enrich_ave_difference > 0, ]$triplet
## [1] "AAA" "CAA" "GAA" "CTA" "GTA" "ACA" "TCA" "AGA" "TGA" "AAT" "CAT" "GAT"
## [13] "ATT" "TTT" "CTT" "GTT" "ACT" "TCT" "AGT" "TGT" "AAC" "TAC" "ATC" "TTC"
## [25] "AAG" "TAG" "ATG" "TTG"
Figure 12: 3mer categorization
all traces in one plot
Based on the above analysis, we can cluster 3mer into () categories, namely “high enrichment”, “low enrichment”, and “no enrichment”. And plot these traces in one single CDF plot.
#module load R/4.1.2
#R
directory_path <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer"
output_file <- "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/other_3mer/all_1st_3mer_distance.csv"
output_connection <- file(output_file, "w") # Open the output file for writing
folder_names <- apply(expand.grid(c("A", "T", "C", "G"), c("A", "T", "C", "G"), c("A", "T", "C", "G")), 1, paste, collapse = "")
#folder_names <- list.dirs(directory_path, full.names = FALSE) # Get a list of folder names in the directory
#define a function to calculate AUC
library(lattice)
library(data.table)
# Iterate through each folder
for (folder_name in folder_names) {
# Set the path to the current folder
current_folder <- file.path(directory_path, folder_name)
# Initialize an empty data.table to store the combined data for the current folder
combined_data <- data.table()
# Read the three .bed files
bed_files <- c(
paste0("closest.1st", folder_name, ".to.indep.DHS.control.consensus.bed"),
paste0("closest.1st", folder_name, ".to.full.DHS.control.consensus.bed"),
paste0("closest.1st", folder_name, ".to.pos.control.motif.all.bed")
)
for (bed_file in bed_files) {
# Read the .bed file into a data.table (adjust column indices accordingly)
current_data <- fread(file.path(current_folder, bed_file), header = FALSE)
# Extract the needed column's data
status <- gsub("closest\\.1st.*to\\.(.*?)\\.bed", "\\1", bed_file)
extracted_data <- current_data[, .(Column1 = V10, Column2 = V11, status)]
# Append the extracted data to the combined_data data.table
combined_data <- rbind(combined_data, extracted_data)
}
# Perform any additional operations on combined_data if needed
# Save the combined_data to the output file
write.table(combined_data, file = output_file, append = TRUE, col.names = FALSE, row.names = FALSE, sep = "\t")
}
# Close the output file
close(output_connection)
CDF plot with all traces (64 pos+64 neg)
library(lattice)
library(latticeExtra)
all_1st_3mer_distance=read.table("all_1st_3mer_distance.csv", header=FALSE)
colnames(all_1st_3mer_distance)=c("triplet", "dis", "status")
all_1st_3mer_distance$supp=paste0(all_1st_3mer_distance$triplet, "_", all_1st_3mer_distance$status)
df.plot=rbind(all_1st_3mer_distance[all_1st_3mer_distance$status=="indep.DHS.control.consensus",], all_1st_3mer_distance[all_1st_3mer_distance$status=="pos.control.motif.all",])
df.plot$supp = factor(df.plot$supp, levels = c("ATA_pos.control.motif.all", "TAT_pos.control.motif.all", "TAA_pos.control.motif.all", "TTA_pos.control.motif.all", "ATC_pos.control.motif.all",
"GAT_pos.control.motif.all", "AAT_pos.control.motif.all", "ATT_pos.control.motif.all", "TTT_pos.control.motif.all", "AAA_pos.control.motif.all",
"AGA_pos.control.motif.all", "TCT_pos.control.motif.all", "TAG_pos.control.motif.all", "CTA_pos.control.motif.all", "ATG_pos.control.motif.all",
"CAT_pos.control.motif.all", "TGA_pos.control.motif.all", "TCA_pos.control.motif.all", "GTA_pos.control.motif.all", "TAC_pos.control.motif.all",
"CTT_pos.control.motif.all", "AAG_pos.control.motif.all", "CAA_pos.control.motif.all", "TTG_pos.control.motif.all", "AAC_pos.control.motif.all",
"GTT_pos.control.motif.all", "ACA_pos.control.motif.all", "TGT_pos.control.motif.all", "GAA_pos.control.motif.all", "TTC_pos.control.motif.all",
"ACT_pos.control.motif.all", "AGT_pos.control.motif.all", "CTG_pos.control.motif.all", "CAG_pos.control.motif.all", "CGA_pos.control.motif.all",
"TCG_pos.control.motif.all", "CGT_pos.control.motif.all", "ACG_pos.control.motif.all", "CTC_pos.control.motif.all", "GAG_pos.control.motif.all",
"GTC_pos.control.motif.all", "GAC_pos.control.motif.all", "GGT_pos.control.motif.all", "ACC_pos.control.motif.all", "GCA_pos.control.motif.all",
"TGC_pos.control.motif.all", "CAC_pos.control.motif.all", "GTG_pos.control.motif.all", "TCC_pos.control.motif.all", "GGA_pos.control.motif.all",
"TGG_pos.control.motif.all", "CCA_pos.control.motif.all", "AGC_pos.control.motif.all", "GCT_pos.control.motif.all", "CCT_pos.control.motif.all",
"AGG_pos.control.motif.all", "CCG_pos.control.motif.all", "CGG_pos.control.motif.all", "CGC_pos.control.motif.all", "GCG_pos.control.motif.all",
"CCC_pos.control.motif.all", "GGG_pos.control.motif.all", "GCC_pos.control.motif.all", "GGC_pos.control.motif.all","ATA_indep.DHS.control.consensus", "TAT_indep.DHS.control.consensus", "TAA_indep.DHS.control.consensus", "TTA_indep.DHS.control.consensus", "ATC_indep.DHS.control.consensus",
"GAT_indep.DHS.control.consensus", "AAT_indep.DHS.control.consensus", "ATT_indep.DHS.control.consensus", "TTT_indep.DHS.control.consensus", "AAA_indep.DHS.control.consensus","AGA_indep.DHS.control.consensus", "TCT_indep.DHS.control.consensus", "TAG_indep.DHS.control.consensus", "CTA_indep.DHS.control.consensus", "ATG_indep.DHS.control.consensus","CAT_indep.DHS.control.consensus", "TGA_indep.DHS.control.consensus", "TCA_indep.DHS.control.consensus", "GTA_indep.DHS.control.consensus", "TAC_indep.DHS.control.consensus", "CTT_indep.DHS.control.consensus", "AAG_indep.DHS.control.consensus", "CAA_indep.DHS.control.consensus", "TTG_indep.DHS.control.consensus", "AAC_indep.DHS.control.consensus", "GTT_indep.DHS.control.consensus", "ACA_indep.DHS.control.consensus", "TGT_indep.DHS.control.consensus", "GAA_indep.DHS.control.consensus", "TTC_indep.DHS.control.consensus", "ACT_indep.DHS.control.consensus", "AGT_indep.DHS.control.consensus", "CTG_indep.DHS.control.consensus", "CAG_indep.DHS.control.consensus", "CGA_indep.DHS.control.consensus", "TCG_indep.DHS.control.consensus", "CGT_indep.DHS.control.consensus", "ACG_indep.DHS.control.consensus", "CTC_indep.DHS.control.consensus", "GAG_indep.DHS.control.consensus", "GTC_indep.DHS.control.consensus", "GAC_indep.DHS.control.consensus", "GGT_indep.DHS.control.consensus", "ACC_indep.DHS.control.consensus", "GCA_indep.DHS.control.consensus", "TGC_indep.DHS.control.consensus", "CAC_indep.DHS.control.consensus", "GTG_indep.DHS.control.consensus", "TCC_indep.DHS.control.consensus", "GGA_indep.DHS.control.consensus", "TGG_indep.DHS.control.consensus", "CCA_indep.DHS.control.consensus", "AGC_indep.DHS.control.consensus", "GCT_indep.DHS.control.consensus", "CCT_indep.DHS.control.consensus", "AGG_indep.DHS.control.consensus", "CCG_indep.DHS.control.consensus", "CGG_indep.DHS.control.consensus", "CGC_indep.DHS.control.consensus", "GCG_indep.DHS.control.consensus", "CCC_indep.DHS.control.consensus", "GGG_indep.DHS.control.consensus", "GCC_indep.DHS.control.consensus", "GGC_indep.DHS.control.consensus"))
pdf('3mer_Distance_from_peak_summit_all_traces.pdf',width=8,height=8)
print(ecdfplot(~log(abs(dis), base = 10), groups = supp, data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("red","pink"))(32),colorRampPalette(c("blue","lightblue"))(32), colorRampPalette(c("grey22","lightgrey"))(64)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("red","pink"))(32),colorRampPalette(c("blue","lightblue"))(32), colorRampPalette(c("grey22","lightgrey"))(64)), lwd=3), strip.background=list(col="grey85"))
#panel = function(...) {
#panel.abline(v= 1, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
#panel.ecdfplot(...)
#}
)
)
dev.off()
This CDF looks too crowded. I will subset to “No enrichment” and “with enrichment” 3mers, then make two CDF plots.
library(lattice)
library(latticeExtra)
all_1st_3mer_distance=read.table("all_1st_3mer_distance.csv", header=FALSE)
colnames(all_1st_3mer_distance)=c("triplet", "dis", "status")
all_1st_3mer_distance$supp=paste0(all_1st_3mer_distance$triplet, "_", all_1st_3mer_distance$status)
df.plot=rbind(all_1st_3mer_distance[all_1st_3mer_distance$status=="indep.DHS.control.consensus",], all_1st_3mer_distance[all_1st_3mer_distance$status=="pos.control.motif.all",])
with_enrichment_list <- c("ATA", "TAT", "TAA", "TTA", "ATC", "GAT", "AAT", "ATT", "TTT", "AAA", "AGA", "TCT", "TAG", "CTA", "ATG", "CAT", "TGA", "TCA", "GTA", "TAC", "CTT", "AAG", "CAA", "TTG", "AAC", "GTT", "ACA", "TGT", "GAA", "TTC", "ACT","AGT")
no_enrichment_list <- c( "CTG", "CAG", "CGA", "TCG", "CGT", "ACG", "CTC", "GAG", "GTC", "GAC", "GGT", "ACC", "GCA", "TGC", "CAC", "GTG", "TCC", "GGA", "TGG", "CCA", "AGC", "GCT", "CCT","AGG", "CCG", "CGG", "CGC", "GCG", "CCC", "GGG", "GCC", "GGC")
df_plot_with_enr=df.plot[df.plot$triplet %in% with_enrichment_list, ]
df_plot_no_enr=df.plot[df.plot$triplet %in% no_enrichment_list, ]
df_plot_with_enr$supp = factor(df_plot_with_enr$supp, levels = c("ATA_pos.control.motif.all", "TAT_pos.control.motif.all", "TAA_pos.control.motif.all", "TTA_pos.control.motif.all", "ATC_pos.control.motif.all",
"GAT_pos.control.motif.all", "AAT_pos.control.motif.all", "ATT_pos.control.motif.all", "TTT_pos.control.motif.all", "AAA_pos.control.motif.all",
"AGA_pos.control.motif.all", "TCT_pos.control.motif.all", "TAG_pos.control.motif.all", "CTA_pos.control.motif.all", "ATG_pos.control.motif.all",
"CAT_pos.control.motif.all", "TGA_pos.control.motif.all", "TCA_pos.control.motif.all", "GTA_pos.control.motif.all", "TAC_pos.control.motif.all",
"CTT_pos.control.motif.all", "AAG_pos.control.motif.all", "CAA_pos.control.motif.all", "TTG_pos.control.motif.all", "AAC_pos.control.motif.all",
"GTT_pos.control.motif.all", "ACA_pos.control.motif.all", "TGT_pos.control.motif.all", "GAA_pos.control.motif.all", "TTC_pos.control.motif.all",
"ACT_pos.control.motif.all", "AGT_pos.control.motif.all",
"ATA_indep.DHS.control.consensus", "TAT_indep.DHS.control.consensus", "TAA_indep.DHS.control.consensus", "TTA_indep.DHS.control.consensus", "ATC_indep.DHS.control.consensus", "GAT_indep.DHS.control.consensus", "AAT_indep.DHS.control.consensus", "ATT_indep.DHS.control.consensus", "TTT_indep.DHS.control.consensus", "AAA_indep.DHS.control.consensus","AGA_indep.DHS.control.consensus", "TCT_indep.DHS.control.consensus", "TAG_indep.DHS.control.consensus", "CTA_indep.DHS.control.consensus", "ATG_indep.DHS.control.consensus","CAT_indep.DHS.control.consensus", "TGA_indep.DHS.control.consensus", "TCA_indep.DHS.control.consensus", "GTA_indep.DHS.control.consensus", "TAC_indep.DHS.control.consensus", "CTT_indep.DHS.control.consensus", "AAG_indep.DHS.control.consensus", "CAA_indep.DHS.control.consensus", "TTG_indep.DHS.control.consensus", "AAC_indep.DHS.control.consensus", "GTT_indep.DHS.control.consensus", "ACA_indep.DHS.control.consensus", "TGT_indep.DHS.control.consensus", "GAA_indep.DHS.control.consensus", "TTC_indep.DHS.control.consensus", "ACT_indep.DHS.control.consensus","AGT_indep.DHS.control.consensus"))
df_plot_no_enr$supp = factor(df_plot_no_enr$supp, levels = c("CTG_pos.control.motif.all", "CAG_pos.control.motif.all", "CGA_pos.control.motif.all",
"TCG_pos.control.motif.all", "CGT_pos.control.motif.all", "ACG_pos.control.motif.all", "CTC_pos.control.motif.all", "GAG_pos.control.motif.all",
"GTC_pos.control.motif.all", "GAC_pos.control.motif.all", "GGT_pos.control.motif.all", "ACC_pos.control.motif.all", "GCA_pos.control.motif.all",
"TGC_pos.control.motif.all", "CAC_pos.control.motif.all", "GTG_pos.control.motif.all", "TCC_pos.control.motif.all", "GGA_pos.control.motif.all",
"TGG_pos.control.motif.all", "CCA_pos.control.motif.all", "AGC_pos.control.motif.all", "GCT_pos.control.motif.all", "CCT_pos.control.motif.all",
"AGG_pos.control.motif.all", "CCG_pos.control.motif.all", "CGG_pos.control.motif.all", "CGC_pos.control.motif.all", "GCG_pos.control.motif.all",
"CCC_pos.control.motif.all", "GGG_pos.control.motif.all", "GCC_pos.control.motif.all", "GGC_pos.control.motif.all", "CTG_indep.DHS.control.consensus", "CAG_indep.DHS.control.consensus", "CGA_indep.DHS.control.consensus", "TCG_indep.DHS.control.consensus", "CGT_indep.DHS.control.consensus", "ACG_indep.DHS.control.consensus", "CTC_indep.DHS.control.consensus", "GAG_indep.DHS.control.consensus", "GTC_indep.DHS.control.consensus", "GAC_indep.DHS.control.consensus", "GGT_indep.DHS.control.consensus", "ACC_indep.DHS.control.consensus", "GCA_indep.DHS.control.consensus", "TGC_indep.DHS.control.consensus", "CAC_indep.DHS.control.consensus", "GTG_indep.DHS.control.consensus", "TCC_indep.DHS.control.consensus", "GGA_indep.DHS.control.consensus", "TGG_indep.DHS.control.consensus", "CCA_indep.DHS.control.consensus", "AGC_indep.DHS.control.consensus", "GCT_indep.DHS.control.consensus", "CCT_indep.DHS.control.consensus", "AGG_indep.DHS.control.consensus", "CCG_indep.DHS.control.consensus", "CGG_indep.DHS.control.consensus", "CGC_indep.DHS.control.consensus", "GCG_indep.DHS.control.consensus", "CCC_indep.DHS.control.consensus", "GGG_indep.DHS.control.consensus", "GCC_indep.DHS.control.consensus", "GGC_indep.DHS.control.consensus"))
pdf('3mer_Distance_from_peak_summit_with_enr_traces.pdf',width=8,height=15)
print(ecdfplot(~log(abs(dis), base = 10), groups = supp, data = df_plot_with_enr,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("red","pink"))(32), colorRampPalette(c("grey22","lightgrey"))(32)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("red","pink"))(32), colorRampPalette(c("grey22","lightgrey"))(32)), lwd=3), strip.background=list(col="grey85"))
#panel = function(...) {
#panel.abline(v= 1, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
#panel.ecdfplot(...)
#}
)
)
dev.off()
pdf('3mer_Distance_from_peak_summit_no_enr_traces.pdf',width=8,height=15)
print(ecdfplot(~log(abs(dis), base = 10), groups = supp, data = df_plot_no_enr,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("blue","lightblue"))(32), colorRampPalette(c("grey22","lightgrey"))(32)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("blue","lightblue"))(32), colorRampPalette(c("grey22","lightgrey"))(32)), lwd=3), strip.background=list(col="grey85"))
#panel = function(...) {
#panel.abline(v= 1, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
#panel.ecdfplot(...)
#}
)
)
dev.off()
We could further separate the 3mers into “enriched” and “neutral” groups.
And we could make 5 CDFs for the five different distance from summits to visualize 3mer change within each category.
library(lattice)
library(latticeExtra)
all_1st_3mer_distance=read.table("all_1st_3mer_distance.csv", header=FALSE)
colnames(all_1st_3mer_distance)=c("triplet", "dis", "status")
all_1st_3mer_distance$supp=paste0(all_1st_3mer_distance$triplet, "_", all_1st_3mer_distance$status)
df.plot=rbind(all_1st_3mer_distance[all_1st_3mer_distance$status=="indep.DHS.control.consensus",], all_1st_3mer_distance[all_1st_3mer_distance$status=="pos.control.motif.all",])
with_enrichment_list <- c("ATA", "TAT", "TAA", "TTA", "ATC", "GAT", "AAT", "ATT", "TTT", "AAA", "AGA", "TCT", "TAG", "CTA", "ATG", "CAT", "TGA", "TCA", "GTA", "TAC", "CTT", "AAG", "CAA", "TTG", "AAC", "GTT", "ACA", "TGT", "GAA", "TTC", "ACT","AGT")
df_plot_with_enr=df.plot[df.plot$triplet %in% with_enrichment_list, ]
df_plot_with_enr$supp = factor(df_plot_with_enr$supp, levels = c( "ATA_indep.DHS.control.consensus", "TAT_indep.DHS.control.consensus", "TAA_indep.DHS.control.consensus", "TTA_indep.DHS.control.consensus", "ATC_indep.DHS.control.consensus", "GAT_indep.DHS.control.consensus", "AAT_indep.DHS.control.consensus", "ATT_indep.DHS.control.consensus", "TTT_indep.DHS.control.consensus", "AAA_indep.DHS.control.consensus","AGA_indep.DHS.control.consensus", "TCT_indep.DHS.control.consensus", "TAG_indep.DHS.control.consensus", "CTA_indep.DHS.control.consensus", "ATG_indep.DHS.control.consensus","CAT_indep.DHS.control.consensus", "TGA_indep.DHS.control.consensus", "TCA_indep.DHS.control.consensus", "GTA_indep.DHS.control.consensus", "TAC_indep.DHS.control.consensus", "CTT_indep.DHS.control.consensus", "AAG_indep.DHS.control.consensus", "CAA_indep.DHS.control.consensus", "TTG_indep.DHS.control.consensus", "AAC_indep.DHS.control.consensus", "GTT_indep.DHS.control.consensus", "ACA_indep.DHS.control.consensus", "TGT_indep.DHS.control.consensus", "GAA_indep.DHS.control.consensus", "TTC_indep.DHS.control.consensus", "ACT_indep.DHS.control.consensus","AGT_indep.DHS.control.consensus", "ATA_pos.control.motif.all", "TAT_pos.control.motif.all", "TAA_pos.control.motif.all", "TTA_pos.control.motif.all", "ATC_pos.control.motif.all", "GAT_pos.control.motif.all", "AAT_pos.control.motif.all", "ATT_pos.control.motif.all", "TTT_pos.control.motif.all", "AAA_pos.control.motif.all", "AGA_pos.control.motif.all", "TCT_pos.control.motif.all", "TAG_pos.control.motif.all", "CTA_pos.control.motif.all", "ATG_pos.control.motif.all", "CAT_pos.control.motif.all", "TGA_pos.control.motif.all", "TCA_pos.control.motif.all", "GTA_pos.control.motif.all", "TAC_pos.control.motif.all", "CTT_pos.control.motif.all", "AAG_pos.control.motif.all", "CAA_pos.control.motif.all", "TTG_pos.control.motif.all", "AAC_pos.control.motif.all", "GTT_pos.control.motif.all", "ACA_pos.control.motif.all", "TGT_pos.control.motif.all", "GAA_pos.control.motif.all", "TTC_pos.control.motif.all", "ACT_pos.control.motif.all", "AGT_pos.control.motif.all"))
pdf('3mer_Distance_from_peak_summit_with_enr_traces_in_group.pdf',width=8,height=15)
print(ecdfplot(~log(abs(dis), base = 10), groups = supp, data = df_plot_with_enr,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("grey22","lightgrey"))(32), "red","red","red","red", "hotpink","hotpink","hotpink","hotpink","pink","pink", colorRampPalette(c("green","lightgreen"))(22)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("grey22","lightgrey"))(32), "red","red","red","red", "hotpink","hotpink","hotpink","hotpink","pink","pink", colorRampPalette(c("green","lightgreen"))(22)), lwd=3), strip.background=list(col="grey85"))
#panel = function(...) {
#panel.abline(v= 1, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
#panel.ecdfplot(...)
#}
)
)
dev.off()
pdf('3mer_Distance_from_peak_summit_no_enr_traces.pdf',width=8,height=15)
print(ecdfplot(~log(abs(dis), base = 10), groups = supp, data = df_plot_no_enr,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("blue","lightblue"))(32), colorRampPalette(c("grey22","lightgrey"))(32)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("blue","lightblue"))(32), colorRampPalette(c("grey22","lightgrey"))(32)), lwd=3), strip.background=list(col="grey85"))
#panel = function(...) {
#panel.abline(v= 1, lty =2, col="red")
#panel.abline(v= 18, lty =2, col="grey30")
#panel.ecdfplot(...)
#}
)
)
dev.off()
There are 32 3mers ( with GAT/ATC, ATA/TAT, TAA/TTA on the top of the list) exhibit closer proximity to the GATA3 peak summit and demonstrate enrichment compared to DHS regions near (use distance==16bp) the GATA3 peak summit.
enriched (n=10): “AAA” “TAA” “ATA” “TTA” “AAT” “TAT” “GAT” “ATT” “TTT” “ATC”
neutral (n=22): “CAA” “GAA” “CTA” “GTA” “ACA” “TCA” “AGA” “TGA” “CAT” “CTT” “GTT” “ACT” “TCT” “AGT” “TGT” “AAC” “TAC” “TTC” “AAG” “TAG” “ATG” “TTG”
There are 32 3mers do not show closer proximity and have no enrichment near GATA3 peaks. These 3mers shows a negative CDF differences in the ranked list. These 3mers are: GGC/GCC, GGG/CCC, GCG/CGC, CGG/CCG, AGG/CCT, GCT/AGC, CCA/TGG, GGA/TCC, GTG/CAC, TGC/GCA, ACC/GGT, GAT/GTC, GAG/CTC, ACG/CGT, TCG/CGA, CAG/CTG. We can put these 3mers aside for now.
depleted: “CTG”, “CAG”, “CGA”, “TCG”, “CGT”, “ACG”, “CTC”, “GAG”, “GTC”, “GAC”, “GGT”, “ACC”, “GCA”, “TGC”, “CAC”, “GTG”, “TCC”, “GGA”, “TGG”, “CCA”, “AGC”, “GCT”, “CCT”,“AGG”, “CCG”, “CGG”, “CGC”, “GCG”, “CCC”, “GGG”, “GCC”, “GGC”
The different 3mer in enriched cluster is AAA/TTT while we change the 16/20bp to 30/40bp. I want to include AAA/TTT since they are reasonable 3mer while looking at the PWM of GATA3 motifs.
In this section, we need to consider two things:
One is to define spacing/distance as the relative position of two zinc finger;
the other is to separate the cases with different strand orientation.
Plan: Use the positive control GATA3 peaks with known motif structure:
First, identify the closest GAT to the peak summit on the plus and minus strands, respectively.
For the plus strand:
a. Anchor at the 1st +GAT and find the 2nd closest GAT on the same strand (+). This is equivalent to searching for the opposite strand ATC (coherence check1).
b. Anchor at the 1st +GAT and find the 2nd closest GAT on the opposite strand (-). This is equivalent to searching for the same strand ATC (coherence check2).
For the minus strand:
a. Anchor at the 1st -GAT and find the 2nd closest GAT on the same strand (-). This is equivalent to searching for the opposite strand ATC (coherence check3).
b. Anchor at the 1st -GAT and find the 2nd closest GAT on the opposite strand (+). This is equivalent to searching for the same strand ATC (coherence check4).
Coherence check5: Visualize on genome browser
Figure 13: GATA3 ChIP peak MME/STREME found motifs
In previous analysis (refer to section 2.7.3 positive control: 1 peak with 1 best motif), we have made specific BED files contain information of GATA3 peaks with its best motifs (five motifs with known structure).
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/MAST_positive_control/
wc -l ${dir}*GATA3_peak_161win_with_motif*.bed
12470 GATA3_peak_161win_with_motif_1.bed
11475 GATA3_peak_161win_with_motif_2.bed
6505 GATA3_peak_161win_with_motif_4.bed
4167 GATA3_peak_161win_with_motif_5.bed
5363 GATA3_peak_161win_with_motif_6.bed
39980 total
We will find the closest GAT on + strand or on - strand for peak with motif1, motif4, motif2, motif6, respectively.
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not use scientific notation when writing out
#write bed formatted data.frames to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
# the command sort a and b file by coordinates
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
# the command call closestBed on bed1 and bed2
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
# remove intermediate files
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
dir1="/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/"
plus.GAT.file=read.table(file = paste0(dir1,"hg38.3.3.3plus.36_GAT.bed"), sep="\t", header=FALSE)
minus.GAT.file=read.table(file = paste0(dir1,"hg38.3.3.3minus.14_GAT.bed"), sep="\t", header=FALSE)
head(plus.GAT.file)
nrow(plus.GAT.file)
tail(minus.GAT.file)
nrow(minus.GAT.file)
library(bigWig)
dir2="/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/MAST_positive_control/"
# pos
GATA3_peak_summits_motif1=center.bed(read.table(paste0(dir2, "GATA3_peak_161win_with_motif_1.bed"), header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
GATA3_peak_summits_motif4=center.bed(read.table(paste0(dir2, "GATA3_peak_161win_with_motif_4.bed"), header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
GATA3_peak_summits_motif2=center.bed(read.table(paste0(dir2, "GATA3_peak_161win_with_motif_2.bed"), header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
GATA3_peak_summits_motif6=center.bed(read.table(paste0(dir2, "GATA3_peak_161win_with_motif_6.bed"), header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
GATA3_peak_summits_motif5=center.bed(read.table(paste0(dir2, "GATA3_peak_161win_with_motif_5.bed"), header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
# consensus neg
indep.DHS.control.consensus=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_consensus_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.consensus)
head(indep.DHS.control.consensus)
# closestBed--1st closest plus
closest.1st.plus.GAT.to.GATA3.with.motif1=bedTools.closest(bed1 = GATA3_peak_summits_motif1[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
closest.1st.plus.GAT.to.GATA3.with.motif4=bedTools.closest(bed1 = GATA3_peak_summits_motif4[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
closest.1st.plus.GAT.to.GATA3.with.motif2=bedTools.closest(bed1 = GATA3_peak_summits_motif2[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
closest.1st.plus.GAT.to.GATA3.with.motif6=bedTools.closest(bed1 = GATA3_peak_summits_motif6[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
closest.1st.plus.GAT.to.GATA3.with.motif5=bedTools.closest(bed1 = GATA3_peak_summits_motif5[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
write.table(closest.1st.plus.GAT.to.GATA3.with.motif1,file= 'closest.1st.plus.GAT.to.GATA3.with.motif1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st.plus.GAT.to.GATA3.with.motif4,file= 'closest.1st.plus.GAT.to.GATA3.with.motif4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st.plus.GAT.to.GATA3.with.motif2,file= 'closest.1st.plus.GAT.to.GATA3.with.motif2.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st.plus.GAT.to.GATA3.with.motif6,file= 'closest.1st.plus.GAT.to.GATA3.with.motif6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st.plus.GAT.to.GATA3.with.motif5,file= 'closest.1st.plus.GAT.to.GATA3.with.motif5.bed', quote=F,sep="\t",col.names=F,row.names=F)
closest.1st.plus.GAT.to.indep.DHS.control.consensus=bedTools.closest(bed1 = indep.DHS.control.consensus[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
write.table(closest.1st.plus.GAT.to.indep.DHS.control.consensus,file= 'closest.1st.plus.GAT.to.indep.DHS.control.consensus.bed', quote=F,sep="\t",col.names=F,row.names=F)
# closestBed--1st closest minus
closest.1st.minus.GAT.to.GATA3.with.motif1=bedTools.closest(bed1 = GATA3_peak_summits_motif1[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
closest.1st.minus.GAT.to.GATA3.with.motif4=bedTools.closest(bed1 = GATA3_peak_summits_motif4[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
closest.1st.minus.GAT.to.GATA3.with.motif2=bedTools.closest(bed1 = GATA3_peak_summits_motif2[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
closest.1st.minus.GAT.to.GATA3.with.motif6=bedTools.closest(bed1 = GATA3_peak_summits_motif6[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
closest.1st.minus.GAT.to.GATA3.with.motif5=bedTools.closest(bed1 = GATA3_peak_summits_motif5[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
write.table(closest.1st.minus.GAT.to.GATA3.with.motif1,file= 'closest.1st.minus.GAT.to.GATA3.with.motif1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st.minus.GAT.to.GATA3.with.motif4,file= 'closest.1st.minus.GAT.to.GATA3.with.motif4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st.minus.GAT.to.GATA3.with.motif2,file= 'closest.1st.minus.GAT.to.GATA3.with.motif2.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st.minus.GAT.to.GATA3.with.motif6,file= 'closest.1st.minus.GAT.to.GATA3.with.motif6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1st.minus.GAT.to.GATA3.with.motif5,file= 'closest.1st.minus.GAT.to.GATA3.with.motif5.bed', quote=F,sep="\t",col.names=F,row.names=F)
closest.1st.minus.GAT.to.indep.DHS.control.consensus=bedTools.closest(bed1 = indep.DHS.control.consensus[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
write.table(closest.1st.minus.GAT.to.indep.DHS.control.consensus,file= 'closest.1st.minus.GAT.to.indep.DHS.control.consensus.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('240125_GATA3_ChIP_clsestbed.Rdata')
Now we have the :
closest.1st.plus.GAT.to.GATA3.with.motif1.bed
closest.1st.minus.GAT.to.GATA3.with.motif1.bed
We want to extract the GAT coordinates info from the BED files, and remove the overlapped regions from the files that contains GAT coordinates info on the entire hg38 genome (hg38.3.3.3plus.36_GAT.bed/hg38.3.3.3minus.14_GAT.bed). Then locate the “second closest” + or - GAT to +/- “closest” GAT.
Then draw PDF to inspect the enrichment.
Remove the first closest GAT with bedtools subtract
Subtract the 1st closest GAT from all.GAT, then find the closest 2nd GAT to the closest 1st GAT.
-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.
module load bedtools
dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/
sort -k1,1 -k2,2n ${dir}hg38.3.3.3plus.36_GAT.bed > hg38.3.3.3plus.36_GAT.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' closest.1st.plus.GAT.to.GATA3.with.motif1.bed | awk '{$1=$1}1' OFS="\t" | uniq | sort -k1,1 -k2,2n > closest.1st.plus.GAT.to.GATA3.with.motif1.uniq.sorted.bed
bedtools subtract -a hg38.3.3.3plus.36_GAT.sorted.bed -b closest.1st.plus.GAT.to.GATA3.with.motif1.uniq.sorted.bed -f 1.00 -s > hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_motif1.bed
wc -l hg38.3.3.3plus.36_GAT.sorted.bed #39090736
wc -l closest.1st.plus.GAT.to.GATA3.with.motif1.uniq.sorted.bed #12461
wc -l hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_motif1.bed #39078275
head -5 hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_motif1.bed
#chr1 11159 11162 36 36 + GAT
#chr1 11462 11465 36 36 + GAT
#chr1 11521 11524 36 36 + GAT
#chr1 11569 11572 36 36 + GAT
#chr1 11651 11654 36 36 + GAT
sort -k1,1 -k2,2n ${dir}hg38.3.3.3minus.14_GAT.bed > hg38.3.3.3minus.14_GAT.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' closest.1st.minus.GAT.to.GATA3.with.motif1.bed | awk '{$1=$1}1' OFS="\t" | uniq | sort -k1,1 -k2,2n > closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed
bedtools subtract -a hg38.3.3.3minus.14_GAT.sorted.bed -b closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed -f 1.00 -s > hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed
wc -l hg38.3.3.3minus.14_GAT.sorted.bed #38717592
wc -l closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed #12460
wc -l hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed #38705132
#38717592-12460
#[1] 38705132
head -5 hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed
#chr1 10545 10548 14 14 - GAT
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11576 11579 14 14 - GAT
#chr1 11597 11600 14 14 - GAT
Find closest 2nd GAT to 1st GAT
find the + or - 2nd GAT to the closest +GAT
find the + or - 2nd GAT to the closest -GAT
#!/usr/bin/env Rscript
Args=commandArgs(TRUE)
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not use scientific notation when writing out
#write bed formatted data.frames to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
# the command sort a and b file by coordinates
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
# the command call closestBed on bed1 and bed2
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
# remove intermediate files
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
library(bigWig)
# GAT coordinates without overlapped closest GAT -- relative to G for plus, relative to C for minus
plus.GAT.file=fiveprime.bed(read.table(file = "hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_motif1.bed", sep="\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
minus.GAT.file=fiveprime.bed(read.table(file = "hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed", sep="\t", header=FALSE), upstreamWindow = -1, downstreamWindow = 1)
head(plus.GAT.file) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at G on GAT
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 11159 11160 36 36 + GAT
#2 chr1 11462 11463 36 36 + GAT
#3 chr1 11521 11522 36 36 + GAT
#4 chr1 11569 11570 36 36 + GAT
#5 chr1 11651 11652 36 36 + GAT
#6 chr1 11684 11685 36 36 + GAT
nrow(plus.GAT.file)
#[1] 39078275
head(minus.GAT.file) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at C on ATC
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 10547 10548 14 14 - GAT
#2 chr1 11147 11148 14 14 - GAT
#3 chr1 11162 11163 14 14 - GAT
#4 chr1 11578 11579 14 14 - GAT
#5 chr1 11599 11600 14 14 - GAT
#6 chr1 11848 11849 14 14 - GAT
nrow(minus.GAT.file)
#[1] 38705132
# anchor position: closest +/- GAT relative to G
closest_plus_GAT_to_GATA3_peak_summits_motif1=fiveprime.bed(read.table("closest.1st.plus.GAT.to.GATA3.with.motif1.uniq.sorted.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
closest_minus_GAT_to_GATA3_peak_summits_motif1=fiveprime.bed(read.table("closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed", header=FALSE), upstreamWindow = -1, downstreamWindow = 1)
nrow(closest_plus_GAT_to_GATA3_peak_summits_motif1)
#[1] 12461
head(closest_plus_GAT_to_GATA3_peak_summits_motif1) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at G on GAT
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 1080566 1080567 36 36 + GAT
#2 chr1 1238219 1238220 36 36 + GAT
#3 chr1 1686429 1686430 36 36 + GAT
#4 chr1 1741128 1741129 36 36 + GAT
#5 chr1 2002016 2002017 36 36 + GAT
#6 chr1 3127380 3127381 36 36 + GAT
nrow(closest_minus_GAT_to_GATA3_peak_summits_motif1)
#[1] 12460
head(closest_minus_GAT_to_GATA3_peak_summits_motif1) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at C on ATC
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 1080574 1080575 14 14 - GAT
#2 chr1 1238227 1238228 14 14 - GAT
#3 chr1 1686413 1686414 14 14 - GAT
#4 chr1 1741178 1741179 14 14 - GAT
#5 chr1 2002024 2002025 14 14 - GAT
#6 chr1 3127369 3127370 14 14 - GAT
# 2nd closest +/-GAT to closest plus GAT
closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif1=bedTools.closest(bed1 = closest_plus_GAT_to_GATA3_peak_summits_motif1[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif1=bedTools.closest(bed1 = closest_minus_GAT_to_GATA3_peak_summits_motif1[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
write.table(closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif1,file= 'closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif1,file= 'closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif1.bed', quote=F,sep="\t",col.names=F,row.names=F)
# 2nd closest +/- GAT to closest minus GAT
closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif1=bedTools.closest(bed1 = closest_plus_GAT_to_GATA3_peak_summits_motif1[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif1=bedTools.closest(bed1 = closest_minus_GAT_to_GATA3_peak_summits_motif1[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
write.table(closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif1,file= 'closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif1,file= 'closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif1.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('240129_GATA3_ChIP_2nd_clsestbed.Rdata')
df.plot = data.frame(matrix(nrow = 0, ncol = 3))
colnames(df.plot) = c("dis", "anchor_status", "query_status")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.GATA3.motif1.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".GATA3.motif1.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.GATA3.motif1.bed")[[1]][1]
print(query_status)
temp = cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status)
df.plot = rbind(df.plot,temp)
}
str(df.plot)
#'data.frame': 49842 obs. of 3 variables:
# $ V1 : chr "240" "3" "117" "114" ...
# $ anchor_status: chr "minus.GAT" "minus.GAT" "minus.GAT" "minus.GAT" ...
# $ query_status : chr "minus.GAT" "minus.GAT" "minus.GAT" "minus.GAT" ...
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
colnames(df.plot) = c("dis", "anchor_status", "query_status")
head(df.plot)
# dis anchor_status query_status
#1 240 minus.GAT minus.GAT
#2 3 minus.GAT minus.GAT
#3 117 minus.GAT minus.GAT
#4 114 minus.GAT minus.GAT
#5 27 minus.GAT minus.GAT
#6 3 minus.GAT minus.GAT
nrow(df.plot)
#[1] 49842
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$dis=as.numeric(df.plot$dis)
library(lattice)
library(latticeExtra)
# PDF plot
pdf('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif1.pdf', width=15,height=5)
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif1",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
#panel.xyplot(x, y, col = "skyblue", pch = 16)
panel.ablineq(v = 8, col = "red", lty = 2, adj=c(0,1), label="8")}
))
dev.off()
png('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif1.png')
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif1",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 8, col = "red", lty = 2, adj=c(0,1), label="8")}
))
dev.off()
Figure 14: PDF plots of GATA3 peak with motif1: 3mer structure
adds individual datapoint on the density plot
df.plot = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.plot) = c("dis","anchor_status", "query_status","abs.dis", "density")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.GATA3.motif1.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".GATA3.motif1.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.GATA3.motif1.bed")[[1]][1]
print(query_status)
temp = as.data.frame(cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status))
colnames(temp) = c("dis", "anchor_status", "query_status")
temp$dis=as.numeric(temp$dis)
temp$abs.dis=abs(temp$dis)
density_values = density(temp$abs.dis, from=0, to=50, bw="nrd0")
temp$density =approx(density_values$x, density_values$y, temp$abs.dis)$y
df.plot = rbind(df.plot,temp)
}
str(df.plot)
#'data.frame': 49842 obs. of 5 variables:
# $ dis : int 240 3 117 114 27 3 5 11 67 39 ...
# $ anchor_status: chr "minus.GAT" "minus.GAT" "minus.GAT" "minus.GAT" ...
# $ query_status : chr "minus.GAT" "minus.GAT" "minus.GAT" "minus.GAT" ...
# $ abs.dis : int 240 3 117 114 27 3 5 11 67 39 ...
# $ density : num NA 0.0165 NA NA 0.0138 ...
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
head(df.plot)
# dis anchor_status query_status abs.dis density
#1 240 minus.GAT minus.GAT 240 NA
#2 3 minus.GAT minus.GAT 3 0.01653378
#3 117 minus.GAT minus.GAT 117 NA
#4 114 minus.GAT minus.GAT 114 NA
#5 27 minus.GAT minus.GAT 27 0.01379179
#6 3 minus.GAT minus.GAT 3 0.01653378
nrow(df.plot)
#[1] 49842
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
library(lattice)
library(latticeExtra)
pdf('PDF2_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif1.pdf', width=15,height=5)
print(xyplot(density ~ abs.dis | anchor_status,
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
aspect = 1,
xlim=c(0,50),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif1",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x, y, ...) {panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
lwd=2,
type = "count",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
}))
dev.off()
df.plot = data.frame(matrix(nrow = 0, ncol = 3))
colnames(df.plot) = c("dis", "anchor_status", "query_status")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.GATA3.motif1.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".GATA3.motif1.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.GATA3.motif1.bed")[[1]][1]
print(query_status)
temp = cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status)
df.plot = rbind(df.plot,temp)
}
str(df.plot)
#'data.frame': 49842 obs. of 3 variables:
# $ V1 : chr "240" "3" "117" "114" ...
# $ anchor_status: chr "minus.GAT" "minus.GAT" "minus.GAT" "minus.GAT" ...
# $ query_status : chr "minus.GAT" "minus.GAT" "minus.GAT" "minus.GAT" ...
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
colnames(df.plot) = c("dis", "anchor_status", "query_status")
head(df.plot)
# dis anchor_status query_status
#1 240 minus.GAT minus.GAT
#2 3 minus.GAT minus.GAT
#3 117 minus.GAT minus.GAT
#4 114 minus.GAT minus.GAT
#5 27 minus.GAT minus.GAT
#6 3 minus.GAT minus.GAT
nrow(df.plot)
#[1] 49842
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$dis=as.integer(df.plot$dis)
library(lattice)
library(latticeExtra)
pdf('his_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif1.pdf', width=15,height=5)
print(histogram( ~ abs(dis) | anchor_status,
groups = query_status,
auto.key = list(space = "right", main="2nd GAT",rectangles=FALSE, lines=TRUE, points=FALSE, cex = 1),
#key = list(space = "right", text=list(c("+ strand", "- strand"), cex=1)),
data = df.plot,
aspect = 1,
breaks=seq(from=0,to=24921, by=1),
#breaks=1,
#nint=1,
type = "density",
xlim=c(0,50),
ylim=c(0, 0.3),
scales=list(relation='free'),
between=list(y=1.0),
layout = c(2, 1),
lwd=2,
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab = "Density",
main="GATA3 peak with motif1",
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x,...) {#panel.histogram(...)
#panel.densityplot(...)
panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
darg=list(bw = "nrd0", kernel="gaussian"),
#lwd=2,
#type = "count",
col=c("red","blue"), ...)
panel.superpose(x, ..., panel.groups=panel.histogram,
col=c('red', 'blue'), alpha=0.4)
}))
dev.off()
png('his_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif1.png')
print(histogram( ~ abs(dis) | anchor_status,
groups = query_status,
auto.key = list(space = "right", main="2nd GAT",rectangles=FALSE, lines=TRUE, points=FALSE, cex = 1),
#key = list(space = "right", text=list(c("+ strand", "- strand"), cex=1)),
data = df.plot,
aspect = 1,
breaks=seq(from=0,to=24921, by=1),
#breaks=1,
#nint=1,
type = "density",
xlim=c(0,50),
ylim=c(0, 0.3),
scales=list(relation='free'),
between=list(y=1.0),
layout = c(2, 1),
lwd=2,
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab = "Density",
main="GATA3 peak with motif1",
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x,...) {#panel.histogram(...)
#panel.densityplot(...)
panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
darg=list(bw = "nrd0", kernel="gaussian"),
#lwd=2,
#type = "count",
col=c("red","blue"), ...)
panel.superpose(x, ..., panel.groups=panel.histogram,
col=c('red', 'blue'), alpha=0.4)
}))
dev.off()
Figure 15: histogram of GATA3 peak with motif1: 3mer structure
Confirm if the y value from lattice “type==density” is the relative frequency of each distance occurrence in data.
calculate_relative_frequency <- function(data) {
# Use table() to create a frequency table
relative_frequencies <- table(data)/length(data)
result <- data.frame(value = names(relative_frequencies), rel_freq = as.vector(relative_frequencies))
return(result)
}
df.plot = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.plot) = c("dis","anchor_status", "query_status","abs.dis", "rel_freq")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.GATA3.motif1.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".GATA3.motif1.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.GATA3.motif1.bed")[[1]][1]
print(query_status)
temp = as.data.frame(cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status))
colnames(temp) = c("dis", "anchor_status", "query_status")
temp$dis=as.integer(temp$dis)
temp$abs.dis=abs(temp$dis)
relative_frequency = calculate_relative_frequency(temp$abs.dis)
temp1= merge(temp, relative_frequency, by.x = "abs.dis", by.y = "value", all.x = TRUE)
df.plot = rbind(df.plot,temp1)
}
str(df.plot)
#'data.frame': 49842 obs. of 5 variables:
# $ abs.dis : int 3 3 3 3 3 3 3 3 3 3 ...
# $ dis : int 3 3 3 3 3 3 3 3 3 3 ...
# $ anchor_status: chr "minus.GAT" "minus.GAT" "minus.GAT" "minus.GAT" ...
# $ query_status : chr "minus.GAT" "minus.GAT" "minus.GAT" "minus.GAT" ...
# $ rel_freq : num 0.0272 0.0272 0.0272 0.0272 0.0272 ...
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
head(df.plot)
# abs.dis dis anchor_status query_status rel_freq
#1 3 3 minus.GAT minus.GAT 0.02720706
#2 3 3 minus.GAT minus.GAT 0.02720706
#3 3 3 minus.GAT minus.GAT 0.02720706
#4 3 3 minus.GAT minus.GAT 0.02720706
#5 3 3 minus.GAT minus.GAT 0.02720706
#6 3 3 minus.GAT minus.GAT 0.02720706
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
nrow(df.plot)
#[1] 49842
nrow(df.plot[!duplicated(df.plot), ])
#1104
summary(df.plot[!duplicated(df.plot), ])
abs.dis dis anchor_status query_status
Min. : 1.00 Min. : 1.00 plus.GAT :537 plus.GAT :550
1st Qu.: 70.75 1st Qu.: 70.75 minus.GAT:567 minus.GAT:554
Median : 139.50 Median : 139.50
Mean : 192.12 Mean : 192.12
3rd Qu.: 212.00 3rd Qu.: 212.00
Max. :1590.00 Max. :1590.00
rel_freq
Min. :8.025e-05
1st Qu.:1.605e-04
Median :6.421e-04
Mean :3.623e-03
3rd Qu.:3.772e-03
Max. :1.623e-01
library(lattice)
library(latticeExtra)
pdf('xy_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif1.pdf', width=15,height=5)
print(xyplot(rel_freq ~ abs.dis | anchor_status,
#data = df.plot[!duplicated(df.plot), ],
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex=1),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.3),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="Frequency",
main="GATA3 peak with motif1",
between=list(y=1.0),
scales = list(x = list(rot = 45, relation='free', at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x,y,...) {panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
lwd=2,
darg=list(bw = "nrd0", kernel="gaussian"),
type = "count",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
})
)
dev.off()
png('xy_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif1.png')
print(xyplot(rel_freq ~ abs.dis | anchor_status,
#data = df.plot[!duplicated(df.plot), ],
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex=1),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.3),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="Frequency",
main="GATA3 peak with motif1",
between=list(y=1.0),
scales = list(x = list(rot = 45, relation='free', at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x,y,...) {panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
lwd=2,
darg=list(bw = "nrd0", kernel="gaussian"),
type = "count",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
})
)
dev.off()
Figure 16: xy plot of GATA3 peak with motif1: 3mer structure
#dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/
#sort -k1,1 -k2,2n ${dir}hg38.3.3.3plus.36_GAT.bed > hg38.3.3.3plus.36_GAT.sorted.bed
#sort -k1,1 -k2,2n ${dir}hg38.3.3.3minus.14_GAT.bed > hg38.3.3.3minus.14_GAT.sorted.bed
module load bedtools
my_motifs=("motif4" "motif2" "motif6" "motif5")
# Use a for loop to iterate over the strings
for motif in "${my_motifs[@]}"
do
echo $motif
awk '{print $4, $5, $6, $7, $8, $9, $10}' closest.1st.plus.GAT.to.GATA3.with.${motif}.bed | awk '{$1=$1}1' OFS="\t" | uniq | sort -k1,1 -k2,2n > closest.1st.plus.GAT.to.GATA3.with.${motif}.uniq.sorted.bed
bedtools subtract -a hg38.3.3.3plus.36_GAT.sorted.bed -b closest.1st.plus.GAT.to.GATA3.with.${motif}.uniq.sorted.bed -f 1.00 -s > hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_${motif}.bed
wc -l hg38.3.3.3plus.36_GAT.sorted.bed
wc -l closest.1st.plus.GAT.to.GATA3.with.${motif}.uniq.sorted.bed
wc -l hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_${motif}.bed
head -5 hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_${motif}.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' closest.1st.minus.GAT.to.GATA3.with.${motif}.bed | awk '{$1=$1}1' OFS="\t" | uniq | sort -k1,1 -k2,2n > closest.1st.minus.GAT.to.GATA3.with.${motif}.uniq.sorted.bed
bedtools subtract -a hg38.3.3.3minus.14_GAT.sorted.bed -b closest.1st.minus.GAT.to.GATA3.with.${motif}.uniq.sorted.bed -f 1.00 -s > hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_${motif}.bed
wc -l hg38.3.3.3minus.14_GAT.sorted.bed
wc -l closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed
wc -l hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed
head -5 hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed
done 2>&1 | tee -a 240129_bedtools_subtract.txt
#cat 240129_bedtools_subtract.txt
motif4
39090736 hg38.3.3.3plus.36_GAT.sorted.bed
6487 closest.1st.plus.GAT.to.GATA3.with.motif4.uniq.sorted.bed
39084249 hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_motif4.bed
#39090736-6487
#[1] 39084249
chr1 11159 11162 36 36 + GAT
chr1 11462 11465 36 36 + GAT
chr1 11521 11524 36 36 + GAT
chr1 11569 11572 36 36 + GAT
chr1 11651 11654 36 36 + GAT
38717592 hg38.3.3.3minus.14_GAT.sorted.bed
12460 closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed
38705132 hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed
#38717592-12460
#[1] 38705132
chr1 10545 10548 14 14 - GAT
chr1 11145 11148 14 14 - GAT
chr1 11160 11163 14 14 - GAT
chr1 11576 11579 14 14 - GAT
chr1 11597 11600 14 14 - GAT
motif2
39090736 hg38.3.3.3plus.36_GAT.sorted.bed
11464 closest.1st.plus.GAT.to.GATA3.with.motif2.uniq.sorted.bed
39079272 hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_motif2.bed
chr1 11159 11162 36 36 + GAT
chr1 11462 11465 36 36 + GAT
chr1 11521 11524 36 36 + GAT
chr1 11569 11572 36 36 + GAT
chr1 11651 11654 36 36 + GAT
38717592 hg38.3.3.3minus.14_GAT.sorted.bed
12460 closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed
38705132 hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed
chr1 10545 10548 14 14 - GAT
chr1 11145 11148 14 14 - GAT
chr1 11160 11163 14 14 - GAT
chr1 11576 11579 14 14 - GAT
chr1 11597 11600 14 14 - GAT
motif6
39090736 hg38.3.3.3plus.36_GAT.sorted.bed
5360 closest.1st.plus.GAT.to.GATA3.with.motif6.uniq.sorted.bed
39085376 hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_motif6.bed
chr1 11159 11162 36 36 + GAT
chr1 11462 11465 36 36 + GAT
chr1 11521 11524 36 36 + GAT
chr1 11569 11572 36 36 + GAT
chr1 11651 11654 36 36 + GAT
38717592 hg38.3.3.3minus.14_GAT.sorted.bed
12460 closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed
38705132 hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed
chr1 10545 10548 14 14 - GAT
chr1 11145 11148 14 14 - GAT
chr1 11160 11163 14 14 - GAT
chr1 11576 11579 14 14 - GAT
chr1 11597 11600 14 14 - GAT
motif5
39090736 hg38.3.3.3plus.36_GAT.sorted.bed
4166 closest.1st.plus.GAT.to.GATA3.with.motif5.uniq.sorted.bed
39086570 hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_motif5.bed
chr1 11159 11162 36 36 + GAT
chr1 11462 11465 36 36 + GAT
chr1 11521 11524 36 36 + GAT
chr1 11569 11572 36 36 + GAT
chr1 11651 11654 36 36 + GAT
38717592 hg38.3.3.3minus.14_GAT.sorted.bed
12460 closest.1st.minus.GAT.to.GATA3.with.motif1.uniq.sorted.bed
38705132 hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_motif1.bed
chr1 10545 10548 14 14 - GAT
chr1 11145 11148 14 14 - GAT
chr1 11160 11163 14 14 - GAT
chr1 11576 11579 14 14 - GAT
chr1 11597 11600 14 14 - GAT
#!/usr/bin/env Rscript
Args=commandArgs(TRUE)
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not use scientific notation when writing out
#write bed formatted data.frames to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
# the command sort a and b file by coordinates
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
# the command call closestBed on bed1 and bed2
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
# remove intermediate files
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
library(bigWig)
my_motifs = c("motif4","motif2", "motif6", "motif5")
for (motif in my_motifs) {
print(motif)
# GAT coordinates without overlapped closest GAT -- relative to G for plus, relative to C for minus
plus.GAT.file=fiveprime.bed(read.table(file = paste0("hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_GATA3_with_", motif, ".bed"), sep="\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
minus.GAT.file=fiveprime.bed(read.table(file =paste0("hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_GATA3_with_", motif, ".bed"), sep="\t", header=FALSE), upstreamWindow = -1, downstreamWindow = 1)
head(plus.GAT.file) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at G on GAT
nrow(plus.GAT.file)
head(minus.GAT.file) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at C on ATC
nrow(minus.GAT.file)
# anchor position: closest +/- GAT relative to G
closest_plus_GAT_to_GATA3_peak_summits_=fiveprime.bed(read.table(paste0("closest.1st.plus.GAT.to.GATA3.with.", motif, ".uniq.sorted.bed"), header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
closest_minus_GAT_to_GATA3_peak_summits=fiveprime.bed(read.table(paste0("closest.1st.minus.GAT.to.GATA3.with.", motif, ".uniq.sorted.bed"), header=FALSE), upstreamWindow = -1, downstreamWindow = 1)
nrow(closest_plus_GAT_to_GATA3_peak_summits)
head(closest_plus_GAT_to_GATA3_peak_summits) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at G on GAT
nrow(closest_minus_GAT_to_GATA3_peak_summits)
head(closest_minus_GAT_to_GATA3_peak_summits) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at C on ATC
# 2nd closest +/-GAT to closest plus GAT
closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.with.motif=bedTools.closest(bed1 = closest_plus_GAT_to_GATA3_peak_summits[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.with.motif=bedTools.closest(bed1 = closest_minus_GAT_to_GATA3_peak_summits[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
write.table(closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.with.motif,file= paste0('closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.', motif, '.bed'), quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.with.motif,file= paste0('closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.', motif, '.bed'), quote=F,sep="\t",col.names=F,row.names=F)
# 2nd closest +/- GAT to closest minus GAT
closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.with.motif=bedTools.closest(bed1 = closest_plus_GAT_to_GATA3_peak_summits[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.with.motif=bedTools.closest(bed1 = closest_minus_GAT_to_GATA3_peak_summits[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
write.table(closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.with.motif,file= paste0('closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.', motif, '.bed'), quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.with.motif,file= paste0('closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.', motif, '.bed'), quote=F,sep="\t",col.names=F,row.names=F)
}
save.image('240129_GATA3_ChIP_2nd_clsestbed2.Rdata')
motif4
df.plot = data.frame(matrix(nrow = 0, ncol = 3))
colnames(df.plot) = c("dis", "anchor_status", "query_status")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.GATA3.motif4.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".GATA3.motif4.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.GATA3.motif4.bed")[[1]][1]
print(query_status)
temp = cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status)
df.plot = rbind(df.plot,temp)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
colnames(df.plot) = c("dis", "anchor_status", "query_status")
head(df.plot)
#dis anchor_status query_status
#1 4 minus.GAT minus.GAT
#2 12 minus.GAT minus.GAT
#3 39 minus.GAT minus.GAT
#4 4 minus.GAT minus.GAT
#5 7 minus.GAT minus.GAT
#6 110 minus.GAT minus.GAT
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$dis=as.numeric(df.plot$dis)
library(lattice)
library(latticeExtra)
# PDF plot
pdf('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif4.pdf', width=15,height=5)
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif4",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 8, col = "red", lty = 2, adj=c(0,1), label="8")
}
))
dev.off()
png('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif4.png')
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif4",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 8, col = "red", lty = 2, adj=c(0,1), label="8")}
))
dev.off()
motif2
df.plot = data.frame(matrix(nrow = 0, ncol = 3))
colnames(df.plot) = c("dis", "anchor_status", "query_status")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.GATA3.motif2.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".GATA3.motif2.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.GATA3.motif2.bed")[[1]][1]
print(query_status)
temp = cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status)
df.plot = rbind(df.plot,temp)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
colnames(df.plot) = c("dis", "anchor_status", "query_status")
head(df.plot)
# dis anchor_status query_status
#1 117 minus.GAT minus.GAT
#2 6 minus.GAT minus.GAT
#3 6 minus.GAT minus.GAT
#4 17 minus.GAT minus.GAT
#5 14 minus.GAT minus.GAT
#6 14 minus.GAT minus.GAT
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$dis=as.numeric(df.plot$dis)
library(lattice)
library(latticeExtra)
# PDF plot
pdf('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif2.pdf', width=15,height=5)
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif2",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 8, col = "red", lty = 2, adj=c(0,1), label="8")
panel.ablineq(v = 9, col = "red", lty = 2, adj=c(0,1), label="9")} # change to actual y relative to each distance x values; xyplot
))
dev.off()
png('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif2.png')
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif2",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 8, col = "red", lty = 2, adj=c(0,1), label="8")
panel.ablineq(v = 9, col = "red", lty = 2, adj=c(0,1), label="9")}
))
dev.off()
motif6
df.plot = data.frame(matrix(nrow = 0, ncol = 3))
colnames(df.plot) = c("dis", "anchor_status", "query_status")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.GATA3.motif6.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".GATA3.motif6.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.GATA3.motif6.bed")[[1]][1]
print(query_status)
temp = cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status)
df.plot = rbind(df.plot,temp)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
colnames(df.plot) = c("dis", "anchor_status", "query_status")
head(df.plot)
# dis anchor_status query_status
#1 9 minus.GAT minus.GAT
#2 9 minus.GAT minus.GAT
#3 166 minus.GAT minus.GAT
#4 9 minus.GAT minus.GAT
#5 15 minus.GAT minus.GAT
#6 10 minus.GAT minus.GAT
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$dis=as.numeric(df.plot$dis)
library(lattice)
library(latticeExtra)
# PDF plot
pdf('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif6.pdf', width=15,height=5)
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif6",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 9, col = "red", lty = 2, adj=c(0,1), label="9")}
))
dev.off()
png('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif6.png')
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif6",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 9, col = "red", lty = 2, adj=c(0,1), label="9")}
))
dev.off()
motif5
df.plot = data.frame(matrix(nrow = 0, ncol = 3))
colnames(df.plot) = c("dis", "anchor_status", "query_status")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.GATA3.motif5.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".GATA3.motif5.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.GATA3.motif5.bed")[[1]][1]
print(query_status)
temp = cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status)
df.plot = rbind(df.plot,temp)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
colnames(df.plot) = c("dis", "anchor_status", "query_status")
head(df.plot)
# dis anchor_status query_status
#1 45 minus.GAT minus.GAT
#2 7 minus.GAT minus.GAT
#3 47 minus.GAT minus.GAT
#4 5 minus.GAT minus.GAT
#5 41 minus.GAT minus.GAT
#6 10 minus.GAT minus.GAT
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$dis=as.numeric(df.plot$dis)
library(lattice)
library(latticeExtra)
# PDF plot
pdf('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif5.pdf', width=15,height=5)
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif5",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 2, col = "red", lty = 2, adj=c(0,1), label="2")
panel.ablineq(v = 5, col = "red", lty = 2, adj=c(0,1), label="5")
panel.ablineq(v = 9, col = "red", lty = 2, adj=c(0,1), label="9")}
))
dev.off()
png('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_motif5.png')
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("red","blue"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="GATA3 peak with motif5",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.ablineq(v = 2, col = "red", lty = 2, adj=c(0,1), label="2")
panel.ablineq(v = 5, col = "red", lty = 2, adj=c(0,1), label="5")
panel.ablineq(v = 9, col = "red", lty = 2, adj=c(0,1), label="9")}
))
dev.off()
alternative PDFs with data points
my_motifs = c("motif4","motif2", "motif6", "motif5")
for (motif in my_motifs) {
print(motif)
df.plot = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.plot) = c("dis","anchor_status", "query_status","abs.dis", "density")
for (closest_2nd_dis in Sys.glob(file.path(paste0("./closest.2nd*GAT.to.1st*GAT.GATA3.", motif, ".bed")))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), paste0(".GATA3.", motif, ".bed"))[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), paste0(".to.1st.*.GATA3.", motif, ".bed"))[[1]][1]
print(query_status)
temp = as.data.frame(cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status))
colnames(temp) = c("dis", "anchor_status", "query_status")
temp$dis=as.numeric(temp$dis)
temp$abs.dis=abs(temp$dis)
density_values = density(temp$abs.dis, from=0, to=50, bw="nrd0")
temp$density =approx(density_values$x, density_values$y, temp$abs.dis)$y
df.plot = rbind(df.plot,temp)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
head(df.plot)
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
library(lattice)
library(latticeExtra)
pdf(paste0('PDF2_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_', motif, '.pdf'), width=15,height=5)
print(xyplot(density ~ abs.dis | anchor_status,
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
aspect = 1,
xlim=c(0,50),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main=paste0("GATA3 peak with ", motif),
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x, y, ...) {panel.densityplot(x,
data = df.plot,
from=0,
to=50,
bw="nrd0",
lty = c(1),
lwd=2,
type = "count",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
}))
dev.off()
}
calculate_relative_frequency <- function(data) {
# Use table() to create a frequency table
relative_frequencies <- table(data)/length(data)
result <- data.frame(value = names(relative_frequencies), rel_freq = as.vector(relative_frequencies))
return(result)
}
my_motifs = c("motif4","motif2", "motif6", "motif5")
for (motif in my_motifs) {
print(motif)
df.plot = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.plot) = c("dis","anchor_status", "query_status","abs.dis", "rel_freq")
for (closest_2nd_dis in Sys.glob(file.path(paste0("./closest.2nd*GAT.to.1st*GAT.GATA3.", motif, ".bed")))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), paste0(".GATA3.", motif, ".bed"))[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), paste0(".to.1st.*.GATA3.", motif, ".bed"))[[1]][1]
print(query_status)
temp = as.data.frame(cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status))
colnames(temp) = c("dis", "anchor_status", "query_status")
temp$dis=as.integer(temp$dis)
temp$abs.dis=abs(temp$dis)
relative_frequency = calculate_relative_frequency(temp$abs.dis)
temp1= merge(temp, relative_frequency, by.x = "abs.dis", by.y = "value", all.x = TRUE)
df.plot = rbind(df.plot,temp1)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
head(df.plot)
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
nrow(df.plot)
#[1] 49842
nrow(df.plot[!duplicated(df.plot), ])
summary(df.plot[!duplicated(df.plot), ])
library(lattice)
library(latticeExtra)
pdf(paste0('xy_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_', motif, '.pdf'), width=15,height=5)
print(xyplot(rel_freq ~ abs.dis | anchor_status,
#data = df.plot[!duplicated(df.plot), ],
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.3),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="Frequency",
main=paste0("GATA3 peak with ", motif),
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x,y,...) {panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
lwd=2,
darg=list(bw = "nrd0", kernel="gaussian"),
type = "count",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
})
)
dev.off()
png(paste0('xy_closest_2nd_GAT_to_closest_1st_GAT_to_GATA3_pos_', motif, '.png'))
print(xyplot(rel_freq ~ abs.dis | anchor_status,
#data = df.plot[!duplicated(df.plot), ],
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.3),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="Frequency",
main=paste0("GATA3 peak with ", motif),
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x,y,...) {panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
lwd=2,
darg=list(bw = "nrd0", kernel="gaussian"),
type = "count",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
})
)
dev.off()
}
Figure 17: xy plot of GATA3 peak with motif4: 3mer structure
Figure 18: xy plot of GATA3 peak with motif2: 3mer structure
Figure 19: xy plot of GATA3 peak with motif6: 3mer structure
Figure 20: xy plot of GATA3 peak with motif5: 3mer structure
closest.1st.plus.GAT.to.indep.DHS.control.consensus.bed
closest.1st.minus.GAT.to.indep.DHS.control.consensus.bed
Remove the first closest GAT with bedtools subtract
Subtract the 1st closest GAT from all.GAT, then find the closest 2nd GAT to the closest 1st GAT.
module load bedtools
#dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/
#sort -k1,1 -k2,2n ${dir}hg38.3.3.3plus.36_GAT.bed > hg38.3.3.3plus.36_GAT.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' closest.1st.plus.GAT.to.indep.DHS.control.consensus.bed | awk '{$1=$1}1' OFS="\t" | uniq | sort -k1,1 -k2,2n > closest.1st.plus.GAT.to.indep.DHS.control.consensus.uniq.sorted.bed
bedtools subtract -a hg38.3.3.3plus.36_GAT.sorted.bed -b closest.1st.plus.GAT.to.indep.DHS.control.consensus.uniq.sorted.bed -f 1.00 -s > hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_indep_DHS_control.bed
wc -l hg38.3.3.3plus.36_GAT.sorted.bed #39090736
wc -l closest.1st.plus.GAT.to.indep.DHS.control.consensus.uniq.sorted.bed #57261
wc -l hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_indep_DHS_control.bed #39033475
head -5 hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_indep_DHS_control.bed
#chr1 11159 11162 36 36 + GAT
#chr1 11462 11465 36 36 + GAT
#chr1 11521 11524 36 36 + GAT
#chr1 11569 11572 36 36 + GAT
#chr1 11651 11654 36 36 + GAT
#sort -k1,1 -k2,2n ${dir}hg38.3.3.3minus.14_GAT.bed > hg38.3.3.3minus.14_GAT.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' closest.1st.minus.GAT.to.indep.DHS.control.consensus.bed | awk '{$1=$1}1' OFS="\t" | uniq | sort -k1,1 -k2,2n > closest.1st.minus.GAT.to.indep.DHS.control.consensus.uniq.sorted.bed
bedtools subtract -a hg38.3.3.3minus.14_GAT.sorted.bed -b closest.1st.minus.GAT.to.indep.DHS.control.consensus.uniq.sorted.bed -f 1.00 -s > hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_indep_DHS_control.bed
wc -l hg38.3.3.3minus.14_GAT.sorted.bed #38717592
wc -l closest.1st.minus.GAT.to.indep.DHS.control.consensus.uniq.sorted.bed #57251
wc -l hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_indep_DHS_control.bed #38660341
head -5 hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_indep_DHS_control.bed
#chr1 10545 10548 14 14 - GAT
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11576 11579 14 14 - GAT
#chr1 11597 11600 14 14 - GAT
Find closest 2nd GAT to 1st GAT
find the + or - 2nd GAT to the closest +GAT
find the + or - 2nd GAT to the closest -GAT
#!/usr/bin/env Rscript
Args=commandArgs(TRUE)
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not use scientific notation when writing out
#write bed formatted data.frames to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
# the command sort a and b file by coordinates
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
# the command call closestBed on bed1 and bed2
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
# remove intermediate files
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
library(bigWig)
# GAT coordinates without overlapped closest GAT -- relative to G for plus, relative to C for minus
plus.GAT.file=fiveprime.bed(read.table(file = "hg38.3.3.3plus.36_GAT_without_1st_plus_GAT_to_indep_DHS_control.bed", sep="\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
minus.GAT.file=fiveprime.bed(read.table(file = "hg38.3.3.3minus.14_GAT_without_1st_minus_GAT_to_indep_DHS_control.bed", sep="\t", header=FALSE), upstreamWindow = -1, downstreamWindow = 1)
head(plus.GAT.file) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at G on GAT
nrow(plus.GAT.file)
head(minus.GAT.file) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at C on ATC
nrow(minus.GAT.file)
# anchor position: closest +/- GAT relative to G
closest_plus_GAT_to_indep_DHS=fiveprime.bed(read.table("closest.1st.plus.GAT.to.indep.DHS.control.consensus.uniq.sorted.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
closest_minus_GAT_to_indep_DHS=fiveprime.bed(read.table("closest.1st.minus.GAT.to.indep.DHS.control.consensus.uniq.sorted.bed", header=FALSE), upstreamWindow = -1, downstreamWindow = 1)
nrow(closest_plus_GAT_to_indep_DHS)
head(closest_plus_GAT_to_indep_DHS) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at G on GAT
nrow(closest_minus_GAT_to_indep_DHS)
head(closest_minus_GAT_to_indep_DHS) # find coordinates at Genome Browser to make sure the anchor point is correct -- anchor at C on ATC
# 2nd closest +/-GAT to closest plus GAT
closest.2nd.plus.GAT.to.1st.plus.GAT.indep.DHS=bedTools.closest(bed1 = closest_plus_GAT_to_indep_DHS[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
closest.2nd.minus.GAT.to.1st.plus.GAT.indep.DHS=bedTools.closest(bed1 = closest_minus_GAT_to_indep_DHS[,1:3], bed2 = plus.GAT.file, opt.string = '-d -t first')
write.table(closest.2nd.plus.GAT.to.1st.plus.GAT.indep.DHS,file= 'closest.2nd.plus.GAT.to.1st.plus.GAT.indep.DHS.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.2nd.minus.GAT.to.1st.plus.GAT.indep.DHS,file= 'closest.2nd.minus.GAT.to.1st.plus.GAT.indep.DHS.bed', quote=F,sep="\t",col.names=F,row.names=F)
# 2nd closest +/- GAT to closest minus GAT
closest.2nd.plus.GAT.to.1st.minus.GAT.indep.DHS=bedTools.closest(bed1 = closest_plus_GAT_to_indep_DHS[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
closest.2nd.minus.GAT.to.1st.minus.GAT.indep.DHS=bedTools.closest(bed1 = closest_minus_GAT_to_indep_DHS[,1:3], bed2 = minus.GAT.file, opt.string = '-d -t first')
write.table(closest.2nd.plus.GAT.to.1st.minus.GAT.indep.DHS,file= 'closest.2nd.plus.GAT.to.1st.minus.GAT.indep.DHS.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.2nd.minus.GAT.to.1st.minus.GAT.indep.DHS,file= 'closest.2nd.minus.GAT.to.1st.minus.GAT.indep.DHS.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('240129_GATA3_ChIP_2nd_clsestbed3.Rdata')
df.plot = data.frame(matrix(nrow = 0, ncol = 3))
colnames(df.plot) = c("dis", "anchor_status", "query_status")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.indep.DHS.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".indep.DHS.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.indep.DHS.bed")[[1]][1]
print(query_status)
temp = cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status)
df.plot = rbind(df.plot,temp)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
colnames(df.plot) = c("dis", "anchor_status", "query_status")
head(df.plot)
#dis anchor_status query_status
#1 17 minus.GAT minus.GAT
#2 23 minus.GAT minus.GAT
#3 62 minus.GAT minus.GAT
#4 17 minus.GAT minus.GAT
#5 83 minus.GAT minus.GAT
#6 25 minus.GAT minus.GAT
nrow(df.plot)
#[1] 229024
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$dis=as.numeric(df.plot$dis)
library(lattice)
library(latticeExtra)
# PDF plot
pdf('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_indep_DHS.pdf', width=15,height=5)
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("pink","skyblue"),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.03),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="independent DHS regions",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=3), strip.background=list(col="grey85"))
))
dev.off()
png('PDF_closest_2nd_GAT_to_closest_1st_GAT_to_indep_DHS.png')
print(densityplot( ~ abs(dis) | anchor_status,
groups = query_status,
data = df.plot,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("pink","skyblue"),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.03),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="independent DHS regions",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=3), strip.background=list(col="grey85"))
))
dev.off()
Figure 21: independent DHS regions: 3mer structure
Add actual data point to the density plot:
df.plot = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.plot) = c("dis","anchor_status", "query_status","abs.dis", "density")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.indep.DHS.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".indep.DHS.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.indep.DHS.bed")[[1]][1]
print(query_status)
temp = as.data.frame(cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status))
colnames(temp) = c("dis", "anchor_status", "query_status")
temp$dis=as.numeric(temp$dis)
temp$abs.dis=abs(temp$dis)
density_values = density(temp$abs.dis, from=0, to=50, bw="nrd0")
temp$density =approx(density_values$x, density_values$y, temp$abs.dis)$y
df.plot = rbind(df.plot,temp)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
head(df.plot)
#dis anchor_status query_status abs.dis density
#1 17 minus.GAT minus.GAT 17 0.009672190
#2 23 minus.GAT minus.GAT 23 0.009327138
#3 62 minus.GAT minus.GAT 62 NA
#4 17 minus.GAT minus.GAT 17 0.009672190
#5 83 minus.GAT minus.GAT 83 NA
#6 25 minus.GAT minus.GAT 25 0.009207771
nrow(df.plot)
#[1] 229024
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
library(lattice)
library(latticeExtra)
pdf('PDF2_closest_2nd_GAT_to_closest_1st_GAT_to_indep_DHS.pdf', width=15,height=5)
print(xyplot(density ~ abs.dis | anchor_status,
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.03),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="probability density function",
main="Indendent DHS Regions",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x, y, ...) {panel.densityplot(x,
from=0,
to=50,
lty = c(1),
lwd=2,
type = "count",
bw="nrd0",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
}))
dev.off()
calculate_relative_frequency <- function(data) {
# Use table() to create a frequency table
relative_frequencies <- table(data)/length(data)
result <- data.frame(value = names(relative_frequencies), rel_freq = as.vector(relative_frequencies))
return(result)
}
df.plot = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.plot) = c("dis","anchor_status", "query_status","abs.dis", "rel_freq")
for (closest_2nd_dis in Sys.glob(file.path("./closest.2nd*GAT.to.1st*GAT.indep.DHS.bed"))) {
print(closest_2nd_dis)
anchor_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.*.to.1st.')[[1]][2]), ".indep.DHS.bed")[[1]][1]
print(anchor_status)
query_status =strsplit((strsplit(strsplit(closest_2nd_dis, "/")[[1]][length(strsplit(closest_2nd_dis, "/")[[1]])], 'closest.2nd.')[[1]][2]), ".to.1st.*.indep.DHS.bed")[[1]][1]
print(query_status)
temp = as.data.frame(cbind(read.table(closest_2nd_dis, header=F, comment.char='')[,11], anchor_status, query_status))
colnames(temp) = c("dis", "anchor_status", "query_status")
temp$dis=as.integer(temp$dis)
temp$abs.dis=abs(temp$dis)
relative_frequency = calculate_relative_frequency(temp$abs.dis)
temp1= merge(temp, relative_frequency, by.x = "abs.dis", by.y = "value", all.x = TRUE)
df.plot = rbind(df.plot,temp1)
}
str(df.plot)
unique(df.plot$anchor_status)
#[1] "minus.GAT" "plus.GAT"
unique(df.plot$query_status)
#[1] "minus.GAT" "plus.GAT"
head(df.plot)
# abs.dis dis anchor_status query_status rel_freq
#1 3 3 minus.GAT minus.GAT 0.01620932
#2 3 3 minus.GAT minus.GAT 0.01620932
#3 3 3 minus.GAT minus.GAT 0.01620932
#4 3 3 minus.GAT minus.GAT 0.01620932
#5 3 3 minus.GAT minus.GAT 0.01620932
#6 3 3 minus.GAT minus.GAT 0.01620932
df.plot$anchor_status = factor(df.plot$anchor_status, levels = c("plus.GAT", "minus.GAT"))
df.plot$query_status = factor(df.plot$query_status, levels = c("plus.GAT", "minus.GAT"))
nrow(df.plot)
#[1] 229024
nrow(df.plot[!duplicated(df.plot), ])
#2859
summary(df.plot[!duplicated(df.plot), ])
abs.dis dis anchor_status query_status
Min. : 1.0 Min. : 1.0 plus.GAT :1401 plus.GAT :1433
1st Qu.: 180.0 1st Qu.: 180.0 minus.GAT:1458 minus.GAT:1426
Median : 359.0 Median : 359.0
Mean : 498.4 Mean : 498.4
3rd Qu.: 569.5 3rd Qu.: 569.5
Max. :42639.0 Max. :42639.0
rel_freq
Min. :1.746e-05
1st Qu.:3.493e-05
Median :1.223e-04
Mean :1.399e-03
3rd Qu.:1.484e-03
Max. :1.039e-01
library(lattice)
library(latticeExtra)
pdf('xy_closest_2nd_GAT_to_closest_1st_GAT_to_indep_DHS.pdf', width=15,height=5)
print(xyplot(rel_freq ~ abs.dis | anchor_status,
#data = df.plot[!duplicated(df.plot), ],
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.3),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="Frequency",
main="Independent DHS Regions",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x,y,...) {panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
lwd=2,
darg=list(bw = "nrd0", kernel="gaussian"),
type = "count",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
})
)
dev.off()
png('xy_closest_2nd_GAT_to_closest_1st_GAT_to_indep_DHS.png')
print(xyplot(rel_freq ~ abs.dis | anchor_status,
#data = df.plot[!duplicated(df.plot), ],
data = df.plot,
groups = query_status,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
aspect = 1,
xlim=c(0,50),
ylim=c(0, 0.3),
#type = c('p', 'smooth'),
xlab = "distance (bp) from 2nd closest GAT to closest GAT",
ylab="Frequency",
main="Independent DHS Regions",
between=list(y=1.0),
scales = list(x = list(rot = 45, at = seq(from = 0, to = 50, by = 5))),
par.settings = list(superpose.line = list(col=c("pink", "skyblue"), lwd=2), strip.background=list(col="grey85")),
panel = function(x,y,...) {panel.densityplot(x, data = df.plot,
from=0,
to=50,
lty = c(1),
lwd=2,
darg=list(bw = "nrd0", kernel="gaussian"),
type = "count",
col=c("pink","skyblue"), ...)
panel.xyplot(x, y,
col=c("red","blue"),
pch=18,
cex=0.6,...)
panel.abline(v=c(seq(1, 10, 1)), col = "grey90")
})
)
dev.off()
Figure 22: xy plot of Independent DHS Regions: 3mer structure
files to upload:
-peak summit BED file.
-3mer-GAT coordinates file (+/-) (in .bigWig format);
-several annotation bed files including: 1) peak summit files; 2) closest GAT to peak summit (+/-); 3) 2nd closest GAT to peak summit (+/-).
GAT coordinates files: convert to bigWig
sort the .bed file:
#GAT
dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/
plus_file=${dir}hg38.3.3.3plus.36_GAT.bed
minus_file=${dir}hg38.3.3.3minus.14_GAT.bed
echo "plus_GAT (.bed)"
wc -l ${plus_file}
head -5 ${plus_file}
echo "minus_GAT (.bed)"
wc -l ${minus_file}
head -5 ${minus_file}
sort -k1,1 -k2,2n ${plus_file} > hg38.3.3.3.1000.plus.36_GAT_sorted.bed
sort -k1,1 -k2,2n ${minus_file} > hg38.3.3.3.1000.minus.14_GAT_sorted.bed
wc -l hg38.3.3.3.1000.plus.36_GAT_sorted.bed
wc -l hg38.3.3.3.1000.minus.14_GAT_sorted.bed
Use bedtools genomecov -bg to convert .bed to .bedGraph.
module load bedtools
module load ucsc_genome/2012.05.22
sizes=/home/FCAM/ssun/Genome_pro/hg38.chrom.sizes
#bedtools genomecov tool
bedtools genomecov -bg -i hg38.3.3.3.1000.plus.36_GAT_sorted.bed -g ${sizes} > hg38.3.3.3.1000.plus.36_GAT.bedGraph
bedtools genomecov -bg -i hg38.3.3.3.1000.minus.14_GAT_sorted.bed -g ${sizes} > hg38.3.3.3.1000.minus.14_GAT.bedGraph
wc -l hg38.3.3.3.1000.plus.36_GAT.bedGraph
wc -l hg38.3.3.3.1000.minus.14_GAT.bedGraph
The final step is to convert .bedGraph to .bigWig with UCSC bedGraphToBigWig tool.
#UCSC bedGraphToBigWig tool
bedGraphToBigWig hg38.3.3.3.1000.plus.36_GAT.bedGraph ${sizes} hg38.3.3.3.1000.plus.36_GAT.bigWig
bedGraphToBigWig hg38.3.3.3.1000.minus.14_GAT.bedGraph ${sizes} hg38.3.3.3.1000.minus.14_GAT.bigWig
#cat UCSCGB.sh_7661175.err
cat UCSCGB.sh_7661175.out
## plus_GAT (.bed)
## 39090736 /labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/hg38.3.3.3plus.36_GAT.bed
## chr1 11159 11162 36 36 + GAT
## chr1 11462 11465 36 36 + GAT
## chr1 11521 11524 36 36 + GAT
## chr1 11569 11572 36 36 + GAT
## chr1 11651 11654 36 36 + GAT
## minus_GAT (.bed)
## 38717592 /labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/hg38.3.3.3minus.14_GAT.bed
## chr1 10545 10548 14 14 - GAT
## chr1 11145 11148 14 14 - GAT
## chr1 11160 11163 14 14 - GAT
## chr1 11576 11579 14 14 - GAT
## chr1 11597 11600 14 14 - GAT
## 39090736 hg38.3.3.3.1000.plus.36_GAT_sorted.bed
## 38717592 hg38.3.3.3.1000.minus.14_GAT_sorted.bed
## 38312401 hg38.3.3.3.1000.plus.36_GAT.bedGraph
## 37963744 hg38.3.3.3.1000.minus.14_GAT.bedGraph
Note that the .bedGraph converted from .bed with bedtools genomecov has less genome coordinates entry. This is because the original .bed file contains adjacent regions, and bedtools genomecov collapse these regions into a single entry in the BEDGraph file by merging or summarizing the coverage data.
Trackhub link: http://guertinlab.cam.uchc.edu/over3mer_hub/hub.txt
The annotation BED files need to be upload via Custom Track.
GATA3 peak with motif4
library(bigWig)
peak.region.summit=center.bed(read.table('/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/MAST_positive_control/GATA3_peak_161win_with_motif_4.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
write.table(peak.region.summit,file= 'GATA3_peak_summit_with_motif_4.bed', quote=F,sep="\t",col.names=F,row.names=F)
head -5 GATA3_peak_summit_with_motif_4.bed
#chr10 101454261 101454262 GATA_ChIP_peak_10760c
#chr10 101703137 101703138 GATA_ChIP_peak_10768a
#chr10 101716956 101716957 GATA_ChIP_peak_10770
#chr10 102509735 102509736 GATA_ChIP_peak_10791b
#chr10 1025947 1025948 GATA_ChIP_peak_7503
# 1) peak summit files
cat GATA3_peak_summit_with_motif_4.bed | sort -k1,1 -k2,2n > GATA3_peak_summit_with_motif_4.sorted.bed
# 2) closest GAT to peak summit (+/-)
awk '{print $4, $5, $6, $7, $8, $9}' closest.1st.plus.GAT.to.GATA3.with.motif4.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.1st.plus.GAT.to.GATA3.with.motif4.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.1st.minus.GAT.to.GATA3.with.motif4.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.1st.minus.GAT.to.GATA3.with.motif4.sorted.bed
# 3) 2nd closest GAT to peak summit (+/-)
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif4.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif4.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif4.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif4.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif4.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif4.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif4.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif4.sorted.bed
Add trackline:
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"GATA3.peak.with.motif4.summit.bed\" description=\"peak.motif4.summit_bed\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' GATA3_peak_summit_with_motif_4.sorted.bed > GATA3_peak_summit_with_motif_4.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.1st.plus.GAT.motif4.bed\" description=\"plus.1st.GAT.m4\" visibility=full autoScale=on useScore=1 color=255,0,0"
} {print $0}' closest.1st.plus.GAT.to.GATA3.with.motif4.sorted.bed > closest.1st.plus.GAT.to.GATA3.with.motif4.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.1st.minus.GAT.motif4.bed\" description=\"minus.1st.GAT.m4\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $0}' closest.1st.minus.GAT.to.GATA3.with.motif4.sorted.bed > closest.1st.minus.GAT.to.GATA3.with.motif4.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif4.bed\" description=\"plus.2nd.GAT.to.plus.1st.m4\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif4.sorted.bed > closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif4.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif4.bed\" description=\"plus.2nd.GAT.to.minus.1st.m4\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif4.sorted.bed > closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif4.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif4.bed\" description=\"minus.2nd.GAT.to.plus.1st.m4\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif4.sorted.bed > closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif4.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif4.bed\" description=\"minus.2nd.GAT.to.minus.1st.m4\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif4.sorted.bed > closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif4.sorted.header.bed
Saved Section:
Figure 23: UCSC Genome browser image: 2nd closest GAT to 1st GAT relaive to GATA3 peaks with motif4
GATA3 peak with motif1
library(bigWig)
peak.region.summit=center.bed(read.table('/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/MAST_positive_control/GATA3_peak_161win_with_motif_1.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
write.table(peak.region.summit,file= 'GATA3_peak_summit_with_motif_1.bed', quote=F,sep="\t",col.names=F,row.names=F)
head -5 GATA3_peak_summit_with_motif_1.bed
#chr10 100072672 100072673 GATA_ChIP_peak_10714
#chr10 100522507 100522508 GATA_ChIP_peak_10733
#chr10 101380202 101380203 GATA_ChIP_peak_10758b
#chr10 101453916 101453917 GATA_ChIP_peak_10760b
#chr10 101892965 101892966 GATA_ChIP_peak_10774
# 1) peak summit files
cat GATA3_peak_summit_with_motif_1.bed | sort -k1,1 -k2,2n > GATA3_peak_summit_with_motif_1.sorted.bed
# 2) closest GAT to peak summit (+/-)
awk '{print $4, $5, $6, $7, $8, $9}' closest.1st.plus.GAT.to.GATA3.with.motif1.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.1st.plus.GAT.to.GATA3.with.motif1.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.1st.minus.GAT.to.GATA3.with.motif1.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.1st.minus.GAT.to.GATA3.with.motif1.sorted.bed
# 3) 2nd closest GAT to peak summit (+/-)
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif1.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif1.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif1.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif1.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif1.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif1.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif1.bed | awk '{$1=$1}1' OFS="\t" | sort -k1,1 -k2,2n > closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif1.sorted.bed
Add trackline:
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"GATA3.peak.with.motif1.summit.bed\" description=\"peak.motif1.summit_bed\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' GATA3_peak_summit_with_motif_1.sorted.bed > GATA3_peak_summit_with_motif_1.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.1st.plus.GAT.motif1.bed\" description=\"plus.1st.GAT.m1\" visibility=full autoScale=on useScore=1 color=255,0,0"
} {print $0}' closest.1st.plus.GAT.to.GATA3.with.motif1.sorted.bed > closest.1st.plus.GAT.to.GATA3.with.motif1.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.1st.minus.GAT.motif1.bed\" description=\"minus.1st.GAT.m1\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $0}' closest.1st.minus.GAT.to.GATA3.with.motif1.sorted.bed > closest.1st.minus.GAT.to.GATA3.with.motif1.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif1.bed\" description=\"plus.2nd.GAT.to.plus.1st.m1\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif1.sorted.bed > closest.2nd.plus.GAT.to.1st.plus.GAT.GATA3.motif1.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif1.bed\" description=\"plus.2nd.GAT.to.minus.1st.m1\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif1.sorted.bed > closest.2nd.plus.GAT.to.1st.minus.GAT.GATA3.motif1.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif1.bed\" description=\"minus.2nd.GAT.to.plus.1st.m1\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif1.sorted.bed > closest.2nd.minus.GAT.to.1st.plus.GAT.GATA3.motif1.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif1.bed\" description=\"minus.2nd.GAT.to.minus.1st.m1\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif1.sorted.bed > closest.2nd.minus.GAT.to.1st.minus.GAT.GATA3.motif1.sorted.header.bed
Saved Section:
Figure 24: UCSC Genome browser image: 2nd closest GAT to 1st GAT relaive to GATA3 peaks with motif1
Is IgG read depth comparable to GATA3 read depth? –YES \#cd GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/read_depth/
#module load R/4.1.2
#R
library("lattice")
df_GATA=read.csv('GATA_sorted_bam_reads.csv')
df_IgG=read.csv('IgG_sorted_bam_reads.csv')
df_sum=rbind(df_GATA[1:3,], df_IgG)
df_sum$supp=c(rep("GATA",3), rep("IgG", 4))
df_sum$rep=c(rep(c(1,2,3),2 ),4)
df_sum$supp2=as.factor(paste(df_sum$treatment,df_sum$rep, sep = "_"))
pdf('barplot_sorted_bam_GATA_IgG_reads1.pdf',width=6,height=8)
print(barchart(aligned_reads~ supp,
group=treatment,
stack=T,
data = df_sum,
#auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
xlab = "libraries",
ylab = "post-alignment reads (sorted_bam)",
))
dev.off()
pdf('barplot_sorted_bam_GATA_IgG_reads.pdf',width=10,height=8)
print(barchart(aligned_reads~ supp2|supp,
group=treatment,
stack=T,
data = df_sum,
#auto.key=list(space="right"),
scales = list(x = list(rot = 45)),
xlab = "libraries",
ylab = "post-alignment reads (sorted_bam)",
))
dev.off()
Figure 25: stacked bar plot of GATA3 libraries and IgG libraries
The read depth in most libraries between GATA3 and IgG is comparable. However, IgG CC_rep2 exhibits higher read depth due to its sequencing in the second pool, which contains fewer samples.
Saturation curve (reads vs. number of peaks) \#! /bin/sh
#SBATCH --job-name=saturation_curve.sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 16
#SBATCH -p general
#SBATCH --qos=general
#SBATCH --mem=128G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o saturation_curve.sh_%j.out
#SBATCH -e saturation_curve.sh_%j.err
hostname
mkdir temp_macs
module load macs3
module load samtools/1.12
directory=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/
# Define subsampling percentages
subsample_percents=$(seq 10 10 100)
for subsample_percent in $subsample_percents
do
subsample_val=$(echo "scale=2; $subsample_percent / 100" | bc)
# Sub-sample treatment BAM files
for treatment_file in ${directory}*_GATA_CC*sorted.bam
do
nm=$(echo $treatment_file | awk -F"/" '{print $NF}' | awk -F"MCF7_dTAGGATA522_" '{print $2}')
samtools view -s $subsample_val -b $treatment_file > subsampled_${subsample_val}_${nm}
done
# Sub-sample control BAM files
for control_file in ${directory}*IgG*sorted.bam
do
name=$(echo $control_file | awk -F"/" '{print $NF}' | awk -F"MCF7_dTAGGATA522_" '{print $2}')
samtools view -s $subsample_val -b $control_file > subsampled_${subsample_val}_${name}
done
# Call peaks using MACS3 with all subsampled treatment and control BAM files
treatment_files=$(ls subsampled_${subsample_val}_GATA*.bam | tr '\n' ' ')
control_files=$(ls subsampled_${subsample_val}_IgG*.bam | tr '\n' ' ')
macs3 callpeak --call-summits -t $treatment_files -c $control_files -n GATA_ChIP_subsample_${subsample_val} -g hs -q 0.01 --keep-dup all -f BAMPE --nomodel --tempdir temp_macs
# Clean up intermediate subsampled files if needed
rm subsampled_${subsample_val}_*.bam
done
#module load R/4.1.2
#R
count_peaks <- function(file_path) {
peak_count <- system(paste("wc -l", file_path), intern = TRUE)
return(as.numeric(strsplit(peak_count, " ")[[1]][1]))
}
# Find all summit.bed files
file_paths <- Sys.glob("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Saturation_curve/*summits*.bed")
# Initialize an empty data frame
data <- data.frame(Read_Depth = numeric(0), Peak_Number = numeric(0))
# Read data and count peaks for each file
for (i in seq_along(file_paths)) {
peak_number <- count_peaks(file_paths[i])
data <- rbind(data, data.frame(Read_Depth = c(100, seq(10, 90, 10))[i], Peak_Number = peak_number))
}
# Plot the saturation curve with lattice
library(lattice)
pdf('240103_saturation_curve.pdf',width=8,height=5)
print(xyplot(Peak_Number ~ Read_Depth, data = data, type = c("p", "smooth"), col = "blue"))
dev.off()
# Fit asymptotic regression
asymptotic_model <- nls(Peak_Number ~ asymptotic_formula(Read_Depth, a, b, c),
data = data,
start = list(a = mean(data$Peak_Number), b = 0.1, c = 10))
# Define the asymptotic formula (replace this with the formula you want to fit)
asymptotic_formula <- function(x, a, b, c) {
a * (1 - exp(-b * x)) + c
}
# Plot the saturation curve with lattice and add fitted curve
pdf('240103_saturation_curve2.pdf',width=8,height=5)
print(xyplot(Peak_Number ~ Read_Depth,
data = data,
type = c("p", "smooth"),
col = "blue",
xlim=c(0, 300),
ylim=c(0, 150000),
xlab="percent read depth (%)",
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.curve(asymptotic_formula(x, coef(asymptotic_model)["a"], coef(asymptotic_model)["b"], coef(asymptotic_model)["c"]), col = "red", add = TRUE)
})
)
dev.off()
png('240103_saturation_curve2.png')
print(xyplot(Peak_Number ~ Read_Depth,
data = data,
type = c("p", "smooth"),
col = "blue",
xlim=c(0, 300),
ylim=c(0, 150000),
xlab="percent read depth (%)",
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.curve(asymptotic_formula(x, coef(asymptotic_model)["a"], coef(asymptotic_model)["b"], coef(asymptotic_model)["c"]), col = "red", add = TRUE)
})
)
dev.off()
Figure 26: saturation curve for GATA3 ChIP-seq peaks
Peak sensitivity and specificity using GATA3-depleted cells and untreated cells. Do sensitive peak sets have a sequence-binding element? \–>
Homology of two GATA3 zinc fingers -N-zinc sequence, C-zinc sequences, align them. \make sure the full structure of zinc finger; and the specificity of each zinc finger binding to the specific sequences.
Sequence Retrieval from Uniprot P23771 · GATA3_HUMAN
zinc finger 1: 263-287: CVNCGATSTPLWRRDGTGHYLCNAC[GLY] ; zinc finger 2: 317-341: CANCQTTTTTLWRRNANGDPVCNAC[GLY]
NCBI BLAST Alignment:
Figure 27: NCBI blast for the two zinc finger
The two zinc finger of GATA3 have 60% identity.
What are the characteristics of the two activation domains? \ What are the characteristics of GATA3 zinc fingers? \RSAT (Regulatory Sequence Analysis Tools)- dyad analysis:
RSAT provides a collection of tools for the analysis of regulatory sequences. It includes tools like dyad-analysis, which allows for the detection of transcription factor binding sites with spaced pairs of k-mers.
http://rsat.sb-roscoff.fr/dyad-analysis_form.cgi
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
fastaFromBed -fi $genome -bed with_motifs_1_101bp_mast.bed -fo with_motifs_1_101bp_mast.fasta
refer to: https://rsa-tools.github.io/installing-RSAT/conda-install-rsat/bioconda-rsat-core.html
cd /home/FCAM/ssun/packages/
source ~/miniconda3/bin/activate
conda create -n rsat
conda env list | grep rsat
#rsat /home/FCAM/ssun/miniconda3/envs/rsat
conda activate rsat
conda list # see what package listed in the env
conda install -c bioconda rsat-core
#testing
which rsat
#~/miniconda3/envs/rsat/bin/rsat
## Statistics::Distributions exists in conda but for some reason it is not working
cpan install Statistics::Distributions
## Email::Sender::Simple is required for the Web server + some particular scripts
cpan install Email::Sender::Simple
## Required for matrix-clustering
cpan install Parallel::ForkManager
cpan install MCE::Shared
## For SOAP::WSDL Web services
cpan install SOAP::WSDL
cpan install XML::Compile::SOAP11
cpan install XML::Compile::WSDL11
cpan install XML::Compile::Transport::SOAPHTTP
rsat --list
rsat --list | wc -l
rsat --help
reading the help menu
cd /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/RSAT/
source ~/miniconda3/bin/activate
conda activate rsat
#rsat --help
rsat oligo-analysis -h
rsat create-background-model -h
rsat dyad-analysis -h
Generating the expected frequency table
The RSAT-dyad analysis provide two ways to control for background. One is using the -bg option: Either using the predefined background frequencies of the selected organism (-org)/taxon(-taxon), or estimating expected dyad frequencies from input sequences (-bg monads (default)/-bg input).
The other one allows users to upload their own expected frequency file with -expfreq.
On the websites, they have a detailed instruction of “How to generate an expected frequency file?”. I will follow their rules to generate the expected frequency table use the DHS regions.
#A sample background file on the website:
head -50 3nt_gene_Saccharomyces_cerevisiae-ovlp-1str.freq
; oligo-analysis -v 1 -1str -i /bio/rsa-tools/data/genomes/Saccharomyces_cerevisiae/genome/Saccharomyces_cerevisiae_gene_segments.wc -format wc -1str -ovlp -l 3 -type dna -return freq,occ -o /bio/rsa-tools/data/genomes/Saccharomyces_cerevisiae/oligo-frequencies/3nt_gene_Saccharomyces_cerevisiae-ovlp-1str.freq
; Citation: van Helden et al. (1998). J Mol Biol 281(5), 827-42.
; Oligomer length 3
; Input file /bio/rsa-tools/data/genomes/Saccharomyces_cerevisiae/genome/Saccharomyces_cerevisiae_gene_segments.wc
; Input format wc
; Output file /bio/rsa-tools/data/genomes/Saccharomyces_cerevisiae/oligo-frequencies/3nt_gene_Saccharomyces_cerevisiae-ovlp-1str.freq
; Count overlapping matches
; Counted on a single strand
; Method
; Sequence type DNA
; Nb of sequences 6080
; Sum of sequence lengths 8809751
; discarded residues 0 (other letters than ACGT)
; discarded occurrences 0 (contain discarded residues)
; nb possible positions 8797591
; total oligo occurrences 8797591
; alphabet size 4
; nb possible oligomers 64
; oligomers tested for significance 0
; Threshold values
; Parameter Lower Upper
; occ_P none 1
;
; column headers
; 1 seq oligomer sequence
; 2 id oligomer identifier
; 3 observed_freq observed relative frequency
; 4 occ observed occurrences
; 5 test over- or under-representation test
;seq id observed_freq occ test
aaa aaa 0.0350766476868 308590
aac aac 0.0184259531956 162104
aag aag 0.0218185864744 191951
aat aat 0.0282970645032 248946
aca aca 0.0168762107718 148470
acc acc 0.0126323217344 111134
acg acg 0.0088157087548 77557
act act 0.0149614820694 131625
aga aga 0.0207571595452 182613
agc agc 0.0118527901559 104276
agg agg 0.0118003894475 103815
agt agt 0.0150120640980 132070
ata ata 0.0206869130424 181995
atc atc 0.0186547658331 164117
atg atg 0.0186416940728 164002
att att 0.0282246583184 248309
caa caa 0.0245105734058 215634
cac cac 0.0108630874065 95569
cag cag 0.0135514369786 119220
cat cat 0.0187047795243 164557
create-background-model: Create a background model by computing oligonucleotide frequencies in a user-specified sequence file (he “background” file).
#rsat convert-background-model -h
#rsat create-background-model -h
#create-background-model [-i inputfile] [-o outputfile] [-v #] [...]
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/ENCODE_DHS_GSE29692/
fastaFromBed -fi $genome -bed ${dir}MCF7DHS_consensus_noGATA_without_motifs_123456_78.bed -fo MCF7DHS_consensus_noGATA_without_motifs_123456_78.fasta
rsat create-background-model -i MCF7DHS_consensus_noGATA_without_motifs_123456_78.fasta -o ENCODE.MCF7.DHS.background.txt -out_format oligos -markov 2
#An oligo file containing k-mers frequencies corresponds to a Markov model of order m=k-1.
rsat oligo-analysis -i MCF7DHS_consensus_noGATA_without_motifs_123456_78.fasta -seqtype dna -l 3 -1str -return freq -lth occ_sig none -v 1 -format fasta -sort -o ENCODE.MCF7.DHS.background2.txt
#-noov do not allow overlapping matches of the same word
#-ovlp Count overlapping occurrences of self-overlpping words.
#-1str Strand-sensitive background model
# -1str inactivates the summation of occurrences on both strands. Count on single strand
head -30 ENCODE.MCF7.DHS.background2.txt
; oligo-analysis -i MCF7DHS_consensus_noGATA_without_motifs_123456_78.fasta -seqtype dna -l 3 -1str -return freq -lth occ_sig none -v 1 -format fasta -sort -o ENCODE.MCF7.DHS.background2.txt
; Citation: van Helden et al. (1998). J Mol Biol 281(5), 827-42.
; Program version 1.169
; Slow counting mode
; Oligomer length 3
; Input file MCF7DHS_consensus_noGATA_without_motifs_123456_78.fasta
; Input format fasta
; Output file ENCODE.MCF7.DHS.background2.txt
; Count overlapping matches
; Counted on a single strand
; Background model Bernoulli
; Background estimation method
; Sequence type DNA
; Nb of sequences 57906
; Sum of sequence lengths 7677291
; discarded residues (other letters than ACGT)
; discarded occurrences (contain discarded residues)
; nb possible positions 7561762
; total oligo occurrences 7561762
; alphabet size 4
; nb possible oligomers 64
; oligomers tested for significance 0
;
; column headers
; 1 seq oligomer sequence
; 2 id oligomer identifier
; 3 obs_freq observed relative frequency
#seq id obs_freq
ctg ctg 0.0290296097656
cag cag 0.0289211694311
The 3mer oligo frequencies stored in ENCODE.MCF7.DHS.background2.txt generated by rsat oligo-analysis is the same as the 3mer oligo frequencies stored in file generated by rsat create-background-model.
awk '{print $1, $3}' ENCODE.MCF7.DHS.background.txt | awk '{$1=$1}1' OFS="\t" > ENCODE.MCF7.DHS.background3.txt
Estimation of expected dyad frequencies
rsat dyad-analysis -v 1 -i MCF7DHS_consensus_noGATA_without_motifs_123456_78.fasta -format fasta -type any -seqtype dna -1str -ovlp -sp 0-20 -l 3 -return freq,occ -o ENCODE.MCF7.DHS.background4.txt
#-ovlp Count overlapping occurrences of self-overlpping words.
Perform RSAT-dyad analysis to all five positive controls
Prepare the positive control files in fasta format.
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/GAT_3mer_analaysis/MAST_positive_control/
fastaFromBed -fi $genome -bed ${dir}GATA3_peak_161win_with_motif_1.bed -fo GATA3_peak_161win_with_motif_1.fasta
fastaFromBed -fi $genome -bed ${dir}GATA3_peak_161win_with_motif_2.bed -fo GATA3_peak_161win_with_motif_2.fasta
fastaFromBed -fi $genome -bed ${dir}GATA3_peak_161win_with_motif_4.bed -fo GATA3_peak_161win_with_motif_4.fasta
fastaFromBed -fi $genome -bed ${dir}GATA3_peak_161win_with_motif_5.bed -fo GATA3_peak_161win_with_motif_5.fasta
fastaFromBed -fi $genome -bed ${dir}GATA3_peak_161win_with_motif_6.bed -fo GATA3_peak_161win_with_motif_6.fasta
rsat dyad-analysis -o GATA3_peak_161win_with_motif_1_RSAT_dyad.txt -i GATA3_peak_161win_with_motif_1.fasta -format FastA -l 3 -sp 0-20 -expfreq ENCODE.MCF7.DHS.background4.txt -return exp_occ,occ,ratio -sort -seqtype dna
#-noov do not allow overlapping matches of the same word
#-zeroocc
#-two_tails
# -groupsp Group dyads made of the same words (monads) but with different spacings.
# -2str count on oth strands; The occurrences of each oligonucleotide are summed on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).
# -1str single strand count; only the direct strand is considered for oligonucleotide and dyad occurrence counting.
# -type dyad_type (dr|ir|any) any (default)
wc -l GATA3_peak_161win_with_motif_1_RSAT_dyad.txt #43681 GATA3_peak_161win_with_motif_1_RSAT_dyad.txt
head GATA3_peak_161win_with_motif_1_RSAT_dyad.txt
#sequence identifier expected_freq occ exp_occ ovl_occ all_occ ratio
#gatn{3}atc gatn{3}atc|gatn{3}atc 0.0000716460535 12482 85.98 36 12518 145.17
#agan{4}atc agan{4}atc|gatn{4}tct 0.0003298219842 10904 390.07 4445 15349 27.95
#atan{2}atc atan{2}atc|gatn{2}tat 0.0001616794646 10453 196.05 1460 11913 53.32
#gatn{4}tca gatn{4}tca|tgan{4}atc 0.0003742452755 7872 442.61 1728 9600 17.79
#agan{3}tat agan{3}tat|atan{3}tct 0.0002979478048 7425 357.56 613 8038 20.77
#gatn{1}tta gatn{1}tta|taan{1}atc 0.0002442959088 6255 300.53 50 6305 20.81
#agan{5}tca agan{5}tca|tgan{5}tct 0.0006445764392 5986 756.26 43 6029 7.92
#attn{2}atc attn{2}atc|gatn{2}aat 0.0002487059113 5956 301.58 690 6646 19.75
#atgn{2}atc atgn{2}atc|gatn{2}cat 0.0003617165575 5268 438.62 51 5319 12.01
Other positive controls:
rsat dyad-analysis -o GATA3_peak_161win_with_motif_2_RSAT_dyad.txt -i GATA3_peak_161win_with_motif_2.fasta -format FastA -l 3 -sp 0-20 -expfreq ENCODE.MCF7.DHS.background4.txt -return exp_occ,occ,ratio -sort -seqtype dna
rsat dyad-analysis -o GATA3_peak_161win_with_motif_4_RSAT_dyad.txt -i GATA3_peak_161win_with_motif_4.fasta -format FastA -l 3 -sp 0-20 -expfreq ENCODE.MCF7.DHS.background4.txt -return exp_occ,occ,ratio -sort -seqtype dna
rsat dyad-analysis -o GATA3_peak_161win_with_motif_5_RSAT_dyad.txt -i GATA3_peak_161win_with_motif_5.fasta -format FastA -l 3 -sp 0-20 -expfreq ENCODE.MCF7.DHS.background4.txt -return exp_occ,occ,ratio -sort -seqtype dna
rsat dyad-analysis -o GATA3_peak_161win_with_motif_6_RSAT_dyad.txt -i GATA3_peak_161win_with_motif_6.fasta -format FastA -l 3 -sp 0-20 -expfreq ENCODE.MCF7.DHS.background4.txt -return exp_occ,occ,ratio -sort -seqtype dna
Rank the output patterns
The last column ratio is the observed/expected occurrence ratio. This ratio can be used as a rough indicator of over-representation, but it might overestimate the patterns with a very low number of expected occurrences.
I will rank the patterns by ratio, and plot the top 15 patterns to a bar plot.
glk-SVM
Suitable for finding co-binding (individual binding elements with several spacing to another individual binding elements) (happens when protein pairs are found; CAP-SELEX)
https://github.com/kevingroup/glk-SVM
https://academic.oup.com/bib/article/21/5/1787/5612164
Given purified zinc fingers and a DNA template, can we observe different spacing/orientations of binding patterns? (cryoEM or in silicon alphafold to find protein-DNA binding) \